1. 14 12月, 2014 4 次提交
    • M
      lib: bitmap: add alignment offset for bitmap_find_next_zero_area() · 5e19b013
      Michal Nazarewicz 提交于
      Add a bitmap_find_next_zero_area_off() function which works like
      bitmap_find_next_zero_area() function except it allows an offset to be
      specified when alignment is checked.  This lets caller request a bit such
      that its number plus the offset is aligned according to the mask.
      
      [gregory.0xf0@gmail.com: Retrieved from https://patchwork.linuxtv.org/patch/6254/ and updated documentation]
      Signed-off-by: NMichal Nazarewicz <mina86@mina86.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NGregory Fong <gregory.0xf0@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kukjin Kim <kgene.kim@samsung.com>
      Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e19b013
    • D
      mm/rmap: share the i_mmap_rwsem · 3dec0ba0
      Davidlohr Bueso 提交于
      Similarly to the anon memory counterpart, we can share the mapping's lock
      ownership as the interval tree is not modified when doing doing the walk,
      only the file page.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3dec0ba0
    • D
      mm: convert i_mmap_mutex to rwsem · c8c06efa
      Davidlohr Bueso 提交于
      The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
      similar data, one for file backed pages and the other for anon memory.  To
      this end, this lock can also be a rwsem.  In addition, there are some
      important opportunities to share the lock when there are no tree
      modifications.
      
      This conversion is straightforward.  For now, all users take the write
      lock.
      
      [sfr@canb.auug.org.au: update fremap.c]
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8c06efa
    • D
      mm,fs: introduce helpers around the i_mmap_mutex · 8b28f621
      Davidlohr Bueso 提交于
      This series is a continuation of the conversion of the i_mmap_mutex to
      rwsem, following what we have for the anon memory counterpart.  With
      Hugh's feedback from the first iteration.
      
      Ultimately, the most obvious paths that require exclusive ownership of the
      lock is when we modify the VMA interval tree, via
      vma_interval_tree_insert() and vma_interval_tree_remove() families.  Cases
      such as unmapping, where the ptes content is changed but the tree remains
      untouched should make it safe to share the i_mmap_rwsem.
      
      As such, the code of course is straightforward, however the devil is very
      much in the details.  While its been tested on a number of workloads
      without anything exploding, I would not be surprised if there are some
      less documented/known assumptions about the lock that could suffer from
      these changes.  Or maybe I'm just missing something, but either way I
      believe its at the point where it could use more eyes and hopefully some
      time in linux-next.
      
      Because the lock type conversion is the heart of this patchset,
      its worth noting a few comparisons between mutex vs rwsem (xadd):
      
        (i) Same size, no extra footprint.
      
        (ii) Both have CONFIG_XXX_SPIN_ON_OWNER capabilities for
             exclusive lock ownership.
      
        (iii) Both can be slightly unfair wrt exclusive ownership, with
              writer lock stealing properties, not necessarily respecting
              FIFO order for granting the lock when contended.
      
        (iv) Mutexes can be slightly faster than rwsems when
             the lock is non-contended.
      
        (v) Both suck at performance for debug (slowpaths), which
            shouldn't matter anyway.
      
      Sharing the lock is obviously beneficial, and sem writer ownership is
      close enough to mutexes.  The biggest winner of these changes is
      migration.
      
      As for concrete numbers, the following performance results are for a
      4-socket 60-core IvyBridge-EX with 130Gb of RAM.
      
      Both alltests and disk (xfs+ramdisk) workloads of aim7 suite do quite well
      with this set, with a steady ~60% throughput (jpm) increase for alltests
      and up to ~30% for disk for high amounts of concurrency.  Lower counts of
      workload users (< 100) does not show much difference at all, so at least
      no regressions.
      
                          3.18-rc1            3.18-rc1-i_mmap_rwsem
      alltests-100     17918.72 (  0.00%)    28417.97 ( 58.59%)
      alltests-200     16529.39 (  0.00%)    26807.92 ( 62.18%)
      alltests-300     16591.17 (  0.00%)    26878.08 ( 62.00%)
      alltests-400     16490.37 (  0.00%)    26664.63 ( 61.70%)
      alltests-500     16593.17 (  0.00%)    26433.72 ( 59.30%)
      alltests-600     16508.56 (  0.00%)    26409.20 ( 59.97%)
      alltests-700     16508.19 (  0.00%)    26298.58 ( 59.31%)
      alltests-800     16437.58 (  0.00%)    26433.02 ( 60.81%)
      alltests-900     16418.35 (  0.00%)    26241.61 ( 59.83%)
      alltests-1000    16369.00 (  0.00%)    26195.76 ( 60.03%)
      alltests-1100    16330.11 (  0.00%)    26133.46 ( 60.03%)
      alltests-1200    16341.30 (  0.00%)    26084.03 ( 59.62%)
      alltests-1300    16304.75 (  0.00%)    26024.74 ( 59.61%)
      alltests-1400    16231.08 (  0.00%)    25952.35 ( 59.89%)
      alltests-1500    16168.06 (  0.00%)    25850.58 ( 59.89%)
      alltests-1600    16142.56 (  0.00%)    25767.42 ( 59.62%)
      alltests-1700    16118.91 (  0.00%)    25689.58 ( 59.38%)
      alltests-1800    16068.06 (  0.00%)    25599.71 ( 59.32%)
      alltests-1900    16046.94 (  0.00%)    25525.92 ( 59.07%)
      alltests-2000    16007.26 (  0.00%)    25513.07 ( 59.38%)
      
      disk-100          7582.14 (  0.00%)     7257.48 ( -4.28%)
      disk-200          6962.44 (  0.00%)     7109.15 (  2.11%)
      disk-300          6435.93 (  0.00%)     6904.75 (  7.28%)
      disk-400          6370.84 (  0.00%)     6861.26 (  7.70%)
      disk-500          6353.42 (  0.00%)     6846.71 (  7.76%)
      disk-600          6368.82 (  0.00%)     6806.75 (  6.88%)
      disk-700          6331.37 (  0.00%)     6796.01 (  7.34%)
      disk-800          6324.22 (  0.00%)     6788.00 (  7.33%)
      disk-900          6253.52 (  0.00%)     6750.43 (  7.95%)
      disk-1000         6242.53 (  0.00%)     6855.11 (  9.81%)
      disk-1100         6234.75 (  0.00%)     6858.47 ( 10.00%)
      disk-1200         6312.76 (  0.00%)     6845.13 (  8.43%)
      disk-1300         6309.95 (  0.00%)     6834.51 (  8.31%)
      disk-1400         6171.76 (  0.00%)     6787.09 (  9.97%)
      disk-1500         6139.81 (  0.00%)     6761.09 ( 10.12%)
      disk-1600         4807.12 (  0.00%)     6725.33 ( 39.90%)
      disk-1700         4669.50 (  0.00%)     5985.38 ( 28.18%)
      disk-1800         4663.51 (  0.00%)     5972.99 ( 28.08%)
      disk-1900         4674.31 (  0.00%)     5949.94 ( 27.29%)
      disk-2000         4668.36 (  0.00%)     5834.93 ( 24.99%)
      
      In addition, a 67.5% increase in successfully migrated NUMA pages, thus
      improving node locality.
      
      The patch layout is simple but designed for bisection (in case reversion
      is needed if the changes break upstream) and easier review:
      
      o Patches 1-4 convert the i_mmap lock from mutex to rwsem.
      o Patches 5-10 share the lock in specific paths, each patch
        details the rationale behind why it should be safe.
      
      This patchset has been tested with: postgres 9.4 (with brand new hugetlb
      support), hugetlbfs test suite (all tests pass, in fact more tests pass
      with these changes than with an upstream kernel), ltp, aim7 benchmarks,
      memcached and iozone with the -B option for mmap'ing.  *Untested* paths
      are nommu, memory-failure, uprobes and xip.
      
      This patch (of 8):
      
      Various parts of the kernel acquire and release this mutex, so add
      i_mmap_lock_write() and immap_unlock_write() helper functions that will
      encapsulate this logic.  The next patch will make use of these.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b28f621
  2. 12 12月, 2014 5 次提交
    • T
      pstore-ram: Allow optional mapping with pgprot_noncached · 027bc8b0
      Tony Lindgren 提交于
      On some ARMs the memory can be mapped pgprot_noncached() and still
      be working for atomic operations. As pointed out by Colin Cross
      <ccross@android.com>, in some cases you do want to use
      pgprot_noncached() if the SoC supports it to see a debug printk
      just before a write hanging the system.
      
      On ARMs, the atomic operations on strongly ordered memory are
      implementation defined. So let's provide an optional kernel parameter
      for configuring pgprot_noncached(), and use pgprot_writecombine() by
      default.
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rob Herring <robherring2@gmail.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: stable@vger.kernel.org
      Acked-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NTony Lindgren <tony@atomide.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      027bc8b0
    • M
      net/mlx4: Add support for A0 steering · 7d077cd3
      Matan Barak 提交于
      Add the required firmware commands for A0 steering and a way to enable
      that. The firmware support focuses on INIT_HCA, QUERY_HCA, QUERY_PORT,
      QUERY_DEV_CAP and QUERY_FUNC_CAP commands. Those commands are used
      to configure and query the device.
      
      The different A0 DMFS (steering) modes are:
      
      Static - optimized performance, but flow steering rules are
      limited. This mode should be choosed explicitly by the user
      in order to be used.
      
      Dynamic - this mode should be explicitly choosed by the user.
      In this mode, the FW works in optimized steering mode as long as
      it can and afterwards automatically drops to classic (full) DMFS.
      
      Disable - this mode should be explicitly choosed by the user.
      The user instructs the system not to use optimized steering, even if
      the FW supports Dynamic A0 DMFS (and thus will be able to use optimized
      steering in Default A0 DMFS mode).
      
      Default - this mode is implicitly choosed. In this mode, if the FW
      supports Dynamic A0 DMFS, it'll work in this mode. Otherwise, it'll
      work at Disable A0 DMFS mode.
      
      Under SRIOV configuration, when the A0 steering mode is enabled,
      older guest VF drivers who aren't using the RX QP allocation flag
      (MLX4_RESERVE_A0_QP) will get a QP from the general range and
      fail when attempting to register a steering rule. To avoid that,
      the PF context behaviour is changed once on A0 static mode, to
      require support for the allocation flag in VF drivers too.
      
      In order to enable A0 steering, we use log_num_mgm_entry_size param.
      If the value of the parameter is not positive, we treat the absolute
      value of log_num_mgm_entry_size as a bit field. Setting bit 2 of this
      bit field enables static A0 steering.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d077cd3
    • M
      net/mlx4: Add A0 hybrid steering · d57febe1
      Matan Barak 提交于
      A0 hybrid steering is a form of high performance flow steering.
      By using this mode, mlx4 cards use a fast limited table based steering,
      in order to enable fast steering of unicast packets to a QP.
      
      In order to implement A0 hybrid steering we allocate resources
      from different zones:
      (1) General range
      (2) Special MAC-assigned QPs [RSS, Raw-Ethernet] each has its own region.
      
      When we create a rss QP or a raw ethernet (A0 steerable and BF ready) QP,
      we try hard to allocate the QP from range (2). Otherwise, we try hard not
      to allocate from this  range. However, when the system is pushed to its
      limits and one needs every resource, the allocator uses every region it can.
      
      Meaning, when we run out of raw-eth qps, the allocator allocates from the
      general range (and the special-A0 area is no longer active). If we run out
      of RSS qps, the mechanism tries to allocate from the raw-eth QP zone. If that
      is also exhausted, the allocator will allocate from the general range
      (and the A0 region is no longer active).
      
      Note that if a raw-eth qp is allocated from the general range, it attempts
      to allocate the range such that bits 6 and 7 (blueflame bits) in the
      QP number are not set.
      
      When the feature is used in SRIOV, the VF has to notify the PF what
      kind of QP attributes it needs. In order to do that, along with the
      "Eth QP blueflame" bit, we reserve a new "A0 steerable QP". According
      to the combination of these bits, the PF tries to allocate a suitable QP.
      
      In order to maintain backward compatibility (with older PFs), the PF
      notifies which QP attributes it supports via QUERY_FUNC_CAP command.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d57febe1
    • E
      net/mlx4: Change QP allocation scheme · ddae0349
      Eugenia Emantayev 提交于
      When using BF (Blue-Flame), the QPN overrides the VLAN, CV, and SV fields
      in the WQE. Thus, BF may only be used for QPNs with bits 6,7 unset.
      
      The current Ethernet driver code reserves a Tx QP range with 256b alignment.
      
      This is wrong because if there are more than 64 Tx QPs in use,
      QPNs >= base + 65 will have bits 6/7 set.
      
      This problem is not specific for the Ethernet driver, any entity that
      tries to reserve more than 64 BF-enabled QPs should fail. Also, using
      ranges is not necessary here and is wasteful.
      
      The new mechanism introduced here will support reservation for
      "Eth QPs eligible for BF" for all drivers: bare-metal, multi-PF, and VFs
      (when hypervisors support WC in VMs). The flow we use is:
      
      1. In mlx4_en, allocate Tx QPs one by one instead of a range allocation,
         and request "BF enabled QPs" if BF is supported for the function
      
      2. In the ALLOC_RES FW command, change param1 to:
      a. param1[23:0]  - number of QPs
      b. param1[31-24] - flags controlling QPs reservation
      
      Bit 31 refers to Eth blueflame supported QPs. Those QPs must have
      bits 6 and 7 unset in order to be used in Ethernet.
      
      Bits 24-30 of the flags are currently reserved.
      
      When a function tries to allocate a QP, it states the required attributes
      for this QP. Those attributes are considered "best-effort". If an attribute,
      such as Ethernet BF enabled QP, is a must-have attribute, the function has
      to check that attribute is supported before trying to do the allocation.
      
      In a lower layer of the code, mlx4_qp_reserve_range masks out the bits
      which are unsupported. If SRIOV is used, the PF validates those attributes
      and masks out unsupported attributes as well. In order to notify VFs which
      attributes are supported, the VF uses QUERY_FUNC_CAP command. This command's
      mailbox is filled by the PF, which notifies which QP allocation attributes
      it supports.
      Signed-off-by: NEugenia Emantayev <eugenia@mellanox.co.il>
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ddae0349
    • M
      net/mlx4_core: Use tasklet for user-space CQ completion events · 3dca0f42
      Matan Barak 提交于
      Previously, we've fired all our completion callbacks straight from our ISR.
      
      Some of those callbacks were lightweight (for example, mlx4_en's and
      IPoIB napi callbacks), but some of them did more work (for example,
      the user-space RDMA stack uverbs' completion handler). Besides that,
      doing more than the minimal work in ISR is generally considered wrong,
      it could even lead to a hard lockup of the system. Since when a lot
      of completion events are generated by the hardware, the loop over those
      events could be so long, that we'll get into a hard lockup by the system
      watchdog.
      
      In order to avoid that, add a new way of invoking completion events
      callbacks. In the interrupt itself, we add the CQs which receive completion
      event to a per-EQ list and schedule a tasklet. In the tasklet context
      we loop over all the CQs in the list and invoke the user callback.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3dca0f42
  3. 11 12月, 2014 31 次提交
    • G
      net: introduce helper macro for_each_cmsghdr · f95b414e
      Gu Zheng 提交于
      Introduce helper macro for_each_cmsghdr as a wrapper of the enumerating
      cmsghdr from msghdr, just cleanup.
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f95b414e
    • J
      printk: add and use LOGLEVEL_<level> defines for KERN_<LEVEL> equivalents · a39d4a85
      Joe Perches 提交于
      Use #defines instead of magic values.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jason Baron <jbaron@akamai.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a39d4a85
    • J
      printk: remove used-once early_vprintk · 1dc6244b
      Joe Perches 提交于
      Eliminate the unlikely possibility of message interleaving for
      early_printk/early_vprintk use.
      
      early_vprintk can be done via the %pV extension so remove this
      unnecessary function and change early_printk to have the equivalent
      vprintk code.
      
      All uses of early_printk already end with a newline so also remove the
      unnecessary newline from the early_printk function.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-by: NChris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1dc6244b
    • P
      kernel: add panic_on_warn · 9e3961a0
      Prarit Bhargava 提交于
      There have been several times where I have had to rebuild a kernel to
      cause a panic when hitting a WARN() in the code in order to get a crash
      dump from a system.  Sometimes this is easy to do, other times (such as
      in the case of a remote admin) it is not trivial to send new images to
      the user.
      
      A much easier method would be a switch to change the WARN() over to a
      panic.  This makes debugging easier in that I can now test the actual
      image the WARN() was seen on and I do not have to engage in remote
      debugging.
      
      This patch adds a panic_on_warn kernel parameter and
      /proc/sys/kernel/panic_on_warn calls panic() in the
      warn_slowpath_common() path.  The function will still print out the
      location of the warning.
      
      An example of the panic_on_warn output:
      
      The first line below is from the WARN_ON() to output the WARN_ON()'s
      location.  After that the panic() output is displayed.
      
          WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]()
          Kernel panic - not syncing: panic_on_warn set ...
      
          CPU: 30 PID: 11698 Comm: insmod Tainted: G        W  OE  3.17.0+ #57
          Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
           0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190
           0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec
           ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df
          Call Trace:
           [<ffffffff81665190>] dump_stack+0x46/0x58
           [<ffffffff8165e2ec>] panic+0xd0/0x204
           [<ffffffffa038e05f>] ? init_dummy+0x1f/0x30 [dummy_module]
           [<ffffffff81076b90>] warn_slowpath_common+0xd0/0xd0
           [<ffffffffa038e040>] ? dummy_greetings+0x40/0x40 [dummy_module]
           [<ffffffff81076c8a>] warn_slowpath_null+0x1a/0x20
           [<ffffffffa038e05f>] init_dummy+0x1f/0x30 [dummy_module]
           [<ffffffff81002144>] do_one_initcall+0xd4/0x210
           [<ffffffff811b52c2>] ? __vunmap+0xc2/0x110
           [<ffffffff810f8889>] load_module+0x16a9/0x1b30
           [<ffffffff810f3d30>] ? store_uevent+0x70/0x70
           [<ffffffff810f49b9>] ? copy_module_from_fd.isra.44+0x129/0x180
           [<ffffffff810f8ec6>] SyS_finit_module+0xa6/0xd0
           [<ffffffff8166cf29>] system_call_fastpath+0x12/0x17
      
      Successfully tested by me.
      
      hpa said: There is another very valid use for this: many operators would
      rather a machine shuts down than being potentially compromised either
      functionally or security-wise.
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Acked-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e3961a0
    • Y
      include/linux/file.h: remove get_unused_fd() macro · f938612d
      Yann Droneaud 提交于
      Macro get_unused_fd() is used to allocate a file descriptor with default
      flags.  Those default flags (0) don't enable close-on-exec.
      
      This can be seen as an unsafe default: in most case close-on-exec should
      be enabled to not leak file descriptor across exec().
      
      It would be better to have a "safer" default set of flags, eg.  O_CLOEXEC
      must be used to enable close-on-exec.
      
      Instead this patch removes get_unused_fd() so that out of tree modules
      won't be affect by a runtime behavor change which might introduce other
      kind of bugs: it's better to catch the change at build time, making it
      easier to fix.
      
      Removing the macro will also promote use of get_unused_fd_flags() (or
      anon_inode_getfd()) with flags provided by userspace.  Or, if flags cannot
      be given by userspace, with flags set to O_CLOEXEC by default.
      Signed-off-by: NYann Droneaud <ydroneaud@opteya.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f938612d
    • O
      exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent() · 7c8bd232
      Oleg Nesterov 提交于
      Now that forget_original_parent() uses ->ptrace_entry for EXIT_DEAD tasks,
      we can simply pass "dead_children" list to exit_ptrace() and remove
      another release_task() loop.  Plus this way we do not need to drop and
      reacquire tasklist_lock.
      
      Also shift the list_empty(ptraced) check, if we want this optimization it
      makes sense to eliminate the function call altogether.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
      Cc: Sterling Alexander <stalexan@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roland McGrath <roland@hack.frob.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c8bd232
    • J
      mm: move page->mem_cgroup bad page handling into generic code · 9edad6ea
      Johannes Weiner 提交于
      Now that the external page_cgroup data structure and its lookup is
      gone, let the generic bad_page() check for page->mem_cgroup sanity.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9edad6ea
    • J
      mm: page_cgroup: rename file to mm/swap_cgroup.c · 5d1ea48b
      Johannes Weiner 提交于
      Now that the external page_cgroup data structure and its lookup is gone,
      the only code remaining in there is swap slot accounting.
      
      Rename it and move the conditional compilation into mm/Makefile.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d1ea48b
    • J
      mm: embed the memcg pointer directly into struct page · 1306a85a
      Johannes Weiner 提交于
      Memory cgroups used to have 5 per-page pointers.  To allow users to
      disable that amount of overhead during runtime, those pointers were
      allocated in a separate array, with a translation layer between them and
      struct page.
      
      There is now only one page pointer remaining: the memcg pointer, that
      indicates which cgroup the page is associated with when charged.  The
      complexity of runtime allocation and the runtime translation overhead is
      no longer justified to save that *potential* 0.19% of memory.  With
      CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
      after the page->private member and doesn't even increase struct page,
      and then this patch actually saves space.  Remaining users that care can
      still compile their kernels without CONFIG_MEMCG.
      
           text    data     bss     dec     hex     filename
        8828345 1725264  983040 11536649 b00909  vmlinux.old
        8827425 1725264  966656 11519345 afc571  vmlinux.new
      
      [mhocko@suse.cz: update Documentation/cgroups/memory.txt]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1306a85a
    • M
      mm, memcg: fix potential undefined behaviour in page stat accounting · e4bd6a02
      Michal Hocko 提交于
      Since commit d7365e78 ("mm: memcontrol: fix missed end-writeback
      page accounting") mem_cgroup_end_page_stat consumes locked and flags
      variables directly rather than via pointers which might trigger C
      undefined behavior as those variables are initialized only in the slow
      path of mem_cgroup_begin_page_stat.
      
      Although mem_cgroup_end_page_stat handles parameters correctly and
      touches them only when they hold a sensible value it is caller which
      loads a potentially uninitialized value which then might allow compiler
      to do crazy things.
      
      I haven't seen any warning from gcc and it seems that the current
      version (4.9) doesn't exploit this type undefined behavior but Sasha has
      reported the following:
      
        UBSan: Undefined behaviour in mm/rmap.c:1084:2
        load of value 255 is not a valid value for type '_Bool'
        CPU: 4 PID: 8304 Comm: rngd Not tainted 3.18.0-rc2-next-20141029-sasha-00039-g77ed13d-dirty #1427
        Call Trace:
          dump_stack (lib/dump_stack.c:52)
          ubsan_epilogue (lib/ubsan.c:159)
          __ubsan_handle_load_invalid_value (lib/ubsan.c:482)
          page_remove_rmap (mm/rmap.c:1084 mm/rmap.c:1096)
          unmap_page_range (./arch/x86/include/asm/atomic.h:27 include/linux/mm.h:463 mm/memory.c:1146 mm/memory.c:1258 mm/memory.c:1279 mm/memory.c:1303)
          unmap_single_vma (mm/memory.c:1348)
          unmap_vmas (mm/memory.c:1377 (discriminator 3))
          exit_mmap (mm/mmap.c:2837)
          mmput (kernel/fork.c:659)
          do_exit (./arch/x86/include/asm/thread_info.h:168 kernel/exit.c:462 kernel/exit.c:747)
          do_group_exit (include/linux/sched.h:775 kernel/exit.c:873)
          SyS_exit_group (kernel/exit.c:901)
          tracesys_phase2 (arch/x86/kernel/entry_64.S:529)
      
      Fix this by using pointer parameters for both locked and flags and be
      more robust for future compiler changes even though the current code is
      implemented correctly.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4bd6a02
    • J
      mm: memcontrol: drop bogus RCU locking from mem_cgroup_same_or_subtree() · 2314b42d
      Johannes Weiner 提交于
      None of the mem_cgroup_same_or_subtree() callers actually require it to
      take the RCU lock, either because they hold it themselves or they have css
      references.  Remove it.
      
      To make the API change clear, rename the leftover helper to
      mem_cgroup_is_descendant() to match cgroup_is_descendant().
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2314b42d
    • J
      mm: memcontrol: pull the NULL check from __mem_cgroup_same_or_subtree() · 413918bb
      Johannes Weiner 提交于
      The NULL in mm_match_cgroup() comes from a possibly exiting mm->owner.  It
      makes a lot more sense to check where it's looked up, rather than check
      for it in __mem_cgroup_same_or_subtree() where it's unexpected.
      
      No other callsite passes NULL to __mem_cgroup_same_or_subtree().
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      413918bb
    • V
      memcg: use generic slab iterators for showing slabinfo · b047501c
      Vladimir Davydov 提交于
      Let's use generic slab_start/next/stop for showing memcg caches info.  In
      contrast to the current implementation, this will work even if all memcg
      caches' info doesn't fit into a seq buffer (a page), plus it simply looks
      neater.
      
      Actually, the main reason I do this isn't mere cleanup.  I'm going to zap
      the memcg_slab_caches list, because I find it useless provided we have the
      slab_caches list, and this patch is a step in this direction.
      
      It should be noted that before this patch an attempt to read
      memory.kmem.slabinfo of a cgroup that doesn't have kmem limit set resulted
      in -EIO, while after this patch it will silently show nothing except the
      header, but I don't think it will frustrate anyone.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b047501c
    • S
      mm, hugetlb: correct bit shift in hstate_sizelog() · 97ad2be1
      Sasha Levin 提交于
      hstate_sizelog() would shift left an int rather than long, triggering
      undefined behaviour and passing an incorrect value when the requested
      page size was more than 4GB, thus breaking >4GB pages.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97ad2be1
    • J
      mm: memcontrol: remove unnecessary PCG_USED pc->mem_cgroup valid flag · 29833315
      Johannes Weiner 提交于
      pc->mem_cgroup had to be left intact after uncharge for the final LRU
      removal, and !PCG_USED indicated whether the page was uncharged.  But
      since commit 0a31bc97 ("mm: memcontrol: rewrite uncharge API") pages
      are uncharged after the final LRU removal.  Uncharge can simply clear
      the pointer and the PCG_USED/PageCgroupUsed sites can test that instead.
      
      Because this is the last page_cgroup flag, this patch reduces the memcg
      per-page overhead to a single pointer.
      
      [akpm@linux-foundation.org: remove unneeded initialization of `memcg', per Michal]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29833315
    • J
      mm: memcontrol: remove unnecessary PCG_MEM memory charge flag · f4aaa8b4
      Johannes Weiner 提交于
      PCG_MEM is a remnant from an earlier version of 0a31bc97 ("mm:
      memcontrol: rewrite uncharge API"), used to tell whether migration cleared
      a charge while leaving pc->mem_cgroup valid and PCG_USED set.  But in the
      final version, mem_cgroup_migrate() directly uncharges the source page,
      rendering this distinction unnecessary.  Remove it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4aaa8b4
    • J
      mm: memcontrol: remove unnecessary PCG_MEMSW memory+swap charge flag · 18eca2e6
      Johannes Weiner 提交于
      Now that mem_cgroup_swapout() fully uncharges the page, every page that is
      still in use when reaching mem_cgroup_uncharge() is known to carry both
      the memory and the memory+swap charge.  Simplify the uncharge path and
      remove the PCG_MEMSW page flag accordingly.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18eca2e6
    • V
      mm, compaction: simplify deferred compaction · 97d47a65
      Vlastimil Babka 提交于
      Since commit 53853e2d ("mm, compaction: defer each zone individually
      instead of preferred zone"), compaction is deferred for each zone where
      sync direct compaction fails, and reset where it succeeds.  However, it
      was observed that for DMA zone compaction often appeared to succeed
      while subsequent allocation attempt would not, due to different outcome
      of watermark check.
      
      In order to properly defer compaction in this zone, the candidate zone
      has to be passed back to __alloc_pages_direct_compact() and compaction
      deferred in the zone after the allocation attempt fails.
      
      The large source of mismatch between watermark check in compaction and
      allocation was the lack of alloc_flags and classzone_idx values in
      compaction, which has been fixed in the previous patch.  So with this
      problem fixed, we can simplify the code by removing the candidate_zone
      parameter and deferring in __alloc_pages_direct_compact().
      
      After this patch, the compaction activity during stress-highalloc
      benchmark is still somewhat increased, but it's negligible compared to the
      increase that occurred without the better watermark checking.  This
      suggests that it is still possible to apparently succeed in compaction but
      fail to allocate, possibly due to parallel allocation activity.
      
      [akpm@linux-foundation.org: fix build]
      Suggested-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97d47a65
    • V
      mm, compaction: pass classzone_idx and alloc_flags to watermark checking · ebff3980
      Vlastimil Babka 提交于
      Compaction relies on zone watermark checks for decisions such as if it's
      worth to start compacting in compaction_suitable() or whether compaction
      should stop in compact_finished().  The watermark checks take
      classzone_idx and alloc_flags parameters, which are related to the memory
      allocation request.  But from the context of compaction they are currently
      passed as 0, including the direct compaction which is invoked to satisfy
      the allocation request, and could therefore know the proper values.
      
      The lack of proper values can lead to mismatch between decisions taken
      during compaction and decisions related to the allocation request.  Lack
      of proper classzone_idx value means that lowmem_reserve is not taken into
      account.  This has manifested (during recent changes to deferred
      compaction) when DMA zone was used as fallback for preferred Normal zone.
      compaction_suitable() without proper classzone_idx would think that the
      watermarks are already satisfied, but watermark check in
      get_page_from_freelist() would fail.  Because of this problem, deferring
      compaction has extra complexity that can be removed in the following
      patch.
      
      The issue (not confirmed in practice) with missing alloc_flags is opposite
      in nature.  For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
      ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
      CMA-enabled systems) the watermark checking in compaction with 0 passed
      will be stricter than in get_page_from_freelist().  In these cases
      compaction might be running for a longer time than is really needed.
      
      Another issue compaction_suitable() is that the check for "does the zone
      need compaction at all?" comes only after the check "does the zone have
      enough free free pages to succeed compaction".  The latter considers extra
      pages for migration and can therefore in some situations fail and return
      COMPACT_SKIPPED, although the high-order allocation would succeed and we
      should return COMPACT_PARTIAL.
      
      This patch fixes these problems by adding alloc_flags and classzone_idx to
      struct compact_control and related functions involved in direct compaction
      and watermark checking.  Where possible, all other callers of
      compaction_suitable() pass proper values where those are known.  This is
      currently limited to classzone_idx, which is sometimes known in kswapd
      context.  However, the direct reclaim callers should_continue_reclaim()
      and compaction_ready() do not currently know the proper values, so the
      coordination between reclaim and compaction may still not be as accurate
      as it could.  This can be fixed later, if it's shown to be an issue.
      
      Additionaly the checks in compact_suitable() are reordered to address the
      second issue described above.
      
      The effect of this patch should be slightly better high-order allocation
      success rates and/or less compaction overhead, depending on the type of
      allocations and presence of CMA.  It allows simplifying deferred
      compaction code in a followup patch.
      
      When testing with stress-highalloc, there was some slight improvement
      (which might be just due to variance) in success rates of non-THP-like
      allocations.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebff3980
    • V
      mm: introduce single zone pcplists drain · 93481ff0
      Vlastimil Babka 提交于
      The functions for draining per-cpu pages back to buddy allocators
      currently always operate on all zones.  There are however several cases
      where the drain is only needed in the context of a single zone, and
      spilling other pcplists is a waste of time both due to the extra
      spilling and later refilling.
      
      This patch introduces new zone pointer parameter to drain_all_pages()
      and changes the dummy parameter of drain_local_pages() to be also a zone
      pointer.  When NULL is passed, the functions operate on all zones as
      usual.  Passing a specific zone pointer reduces the work to the single
      zone.
      
      All callers are updated to pass the NULL pointer in this patch.
      Conversion to single zone (where appropriate) is done in further
      patches.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93481ff0
    • J
      mm: memcontrol: remove obsolete kmemcg pinning tricks · 64f21993
      Johannes Weiner 提交于
      As charges now pin the css explicitely, there is no more need for kmemcg
      to acquire a proxy reference for outstanding pages during offlining, or
      maintain state to identify such "dead" groups.
      
      This was the last user of the uncharge functions' return values, so remove
      them as well.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64f21993
    • J
      mm: memcontrol: take a css reference for each charged page · e8ea14cc
      Johannes Weiner 提交于
      Charges currently pin the css indirectly by playing tricks during
      css_offline(): user pages stall the offlining process until all of them
      have been reparented, whereas kmemcg acquires a keep-alive reference if
      outstanding kernel pages are detected at that point.
      
      In preparation for removing all this complexity, make the pinning explicit
      and acquire a css references for every charged page.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8ea14cc
    • J
      kernel: res_counter: remove the unused API · 5b1efc02
      Johannes Weiner 提交于
      All memory accounting and limiting has been switched over to the
      lockless page counters.  Bye, res_counter!
      
      [akpm@linux-foundation.org: update Documentation/cgroups/memory.txt]
      [mhocko@suse.cz: ditch the last remainings of res_counter]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paul Bolle <pebolle@tiscali.nl>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b1efc02
    • J
      mm: hugetlb_cgroup: convert to lockless page counters · 71f87bee
      Johannes Weiner 提交于
      Abandon the spinlock-protected byte counters in favor of the unlocked
      page counters in the hugetlb controller as well.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71f87bee
    • J
      mm: memcontrol: lockless page counters · 3e32cb2e
      Johannes Weiner 提交于
      Memory is internally accounted in bytes, using spinlock-protected 64-bit
      counters, even though the smallest accounting delta is a page.  The
      counter interface is also convoluted and does too many things.
      
      Introduce a new lockless word-sized page counter API, then change all
      memory accounting over to it.  The translation from and to bytes then only
      happens when interfacing with userspace.
      
      The removed locking overhead is noticable when scaling beyond the per-cpu
      charge caches - on a 4-socket machine with 144-threads, the following test
      shows the performance differences of 288 memcgs concurrently running a
      page fault benchmark:
      
      vanilla:
      
         18631648.500498      task-clock (msec)         #  140.643 CPUs utilized            ( +-  0.33% )
               1,380,638      context-switches          #    0.074 K/sec                    ( +-  0.75% )
                  24,390      cpu-migrations            #    0.001 K/sec                    ( +-  8.44% )
           1,843,305,768      page-faults               #    0.099 M/sec                    ( +-  0.00% )
      50,134,994,088,218      cycles                    #    2.691 GHz                      ( +-  0.33% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       8,049,712,224,651      instructions              #    0.16  insns per cycle          ( +-  0.04% )
       1,586,970,584,979      branches                  #   85.176 M/sec                    ( +-  0.05% )
           1,724,989,949      branch-misses             #    0.11% of all branches          ( +-  0.48% )
      
           132.474343877 seconds time elapsed                                          ( +-  0.21% )
      
      lockless:
      
         12195979.037525      task-clock (msec)         #  133.480 CPUs utilized            ( +-  0.18% )
                 832,850      context-switches          #    0.068 K/sec                    ( +-  0.54% )
                  15,624      cpu-migrations            #    0.001 K/sec                    ( +- 10.17% )
           1,843,304,774      page-faults               #    0.151 M/sec                    ( +-  0.00% )
      32,811,216,801,141      cycles                    #    2.690 GHz                      ( +-  0.18% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       9,999,265,091,727      instructions              #    0.30  insns per cycle          ( +-  0.10% )
       2,076,759,325,203      branches                  #  170.282 M/sec                    ( +-  0.12% )
           1,656,917,214      branch-misses             #    0.08% of all branches          ( +-  0.55% )
      
            91.369330729 seconds time elapsed                                          ( +-  0.45% )
      
      On top of improved scalability, this also gets rid of the icky long long
      types in the very heart of memcg, which is great for 32 bit and also makes
      the code a lot more readable.
      
      Notable differences between the old and new API:
      
      - res_counter_charge() and res_counter_charge_nofail() become
        page_counter_try_charge() and page_counter_charge() resp. to match
        the more common kernel naming scheme of try_do()/do()
      
      - res_counter_uncharge_until() is only ever used to cancel a local
        counter and never to uncharge bigger segments of a hierarchy, so
        it's replaced by the simpler page_counter_cancel()
      
      - res_counter_set_limit() is replaced by page_counter_limit(), which
        expects its callers to serialize against themselves
      
      - res_counter_memparse_write_strategy() is replaced by
        page_counter_limit(), which rounds down to the nearest page size -
        rather than up.  This is more reasonable for explicitely requested
        hard upper limits.
      
      - to keep charging light-weight, page_counter_try_charge() charges
        speculatively, only to roll back if the result exceeds the limit.
        Because of this, a failing bigger charge can temporarily lock out
        smaller charges that would otherwise succeed.  The error is bounded
        to the difference between the smallest and the biggest possible
        charge size, so for memcg, this means that a failing THP charge can
        send base page charges into reclaim upto 2MB (4MB) before the limit
        would have been reached.  This should be acceptable.
      
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e32cb2e
    • J
      irda: Convert function pointer arrays and uses to const · 785c20a0
      Joe Perches 提交于
      Making things const is a good thing.
      
      (x86-64 defconfig with all irda)
      $ size net/irda/built-in.o*
         text	   data	    bss	    dec	    hex	filename
       109276	   1868	    244	 111388	  1b31c	net/irda/built-in.o.new
       108828	   2316	    244	 111388	  1b31c	net/irda/built-in.o.old
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      785c20a0
    • J
      llc: Make llc_sap_action_t function pointer arrays const · 22bbf5f3
      Joe Perches 提交于
      It's better when function pointer arrays aren't modifiable.
      
      Net change:
      
      $ size net/llc/built-in.o.*
         text	   data	    bss	    dec	    hex	filename
        61193	  12758	   1344	  75295	  1261f	net/llc/built-in.o.new
        47113	  27030	   1344	  75487	  126df	net/llc/built-in.o.old
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22bbf5f3
    • J
      llc: Make llc_conn_ev_qfyr_t function pointer arrays const · 9b373069
      Joe Perches 提交于
      It's better when function pointer arrays aren't modifiable.
      
      Net change from original:
      
      $ size net/llc/built-in.o.*
         text	   data	    bss	    dec	    hex	filename
        61065	  12886	   1344	  75295	  1261f	net/llc/built-in.o.new
        47113	  27030	   1344	  75487	  126df	net/llc/built-in.o.old
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b373069
    • J
      llc: Make function pointer arrays const · 14b7d95f
      Joe Perches 提交于
      It's better when function pointer arrays aren't modifiable.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14b7d95f
    • D
      net, lib: kill arch_fast_hash library bits · 0cb6c969
      Daniel Borkmann 提交于
      As there are now no remaining users of arch_fast_hash(), lets kill
      it entirely.
      
      This basically reverts commit 71ae8aac ("lib: introduce arch
      optimized hash library") and follow-up work, that is f.e., commit
      23721754 ("lib: hash: follow-up fixups for arch hash"),
      commit e3fec2f7 ("lib: Add missing arch generic-y entries for
      asm-generic/hash.h") and last but not least commit 6a02652d
      ("perf tools: Fix include for non x86 architectures").
      
      Cc: Francesco Fusco <fusco@ntop.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cb6c969
    • A
      net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb · fd11a83d
      Alexander Duyck 提交于
      This change pulls the core functionality out of __netdev_alloc_skb and
      places them in a new function named __alloc_rx_skb.  The reason for doing
      this is to make these bits accessible to a new function __napi_alloc_skb.
      In addition __alloc_rx_skb now has a new flags value that is used to
      determine which page frag pool to allocate from.  If the SKB_ALLOC_NAPI
      flag is set then the NAPI pool is used.  The advantage of this is that we
      do not have to use local_irq_save/restore when accessing the NAPI pool from
      NAPI context.
      
      In my test setup I saw at least 11ns of savings using the napi_alloc_skb
      function versus the netdev_alloc_skb function, most of this being due to
      the fact that we didn't have to call local_irq_save/restore.
      
      The main use case for napi_alloc_skb would be for things such as copybreak
      or page fragment based receive paths where an skb is allocated after the
      data has been received instead of before.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd11a83d