1. 15 9月, 2017 1 次提交
  2. 14 9月, 2017 2 次提交
  3. 11 9月, 2017 1 次提交
  4. 10 9月, 2017 5 次提交
  5. 09 9月, 2017 12 次提交
    • W
      watchdog: Revert "iTCO_wdt: all versions count down twice" · fc61e83a
      Wim Van Sebroeck 提交于
      This reverts commit 1fccb730.
      Reported as Bug 196509 - iTCO_wdt regression reboot before timeout expire
      Signed-off-by: NWim Van Sebroeck <wim@iguana.be>
      fc61e83a
    • B
      dt-binding: net: sfp binding documentation · 3ef37140
      Baruch Siach 提交于
      Add device-tree binding documentation SFP transceivers. Support for SFP
      transceivers has been recently introduced (drivers/net/phy/sfp.c).
      Signed-off-by: NBaruch Siach <baruch@tkos.co.il>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ef37140
    • B
      dt-bindings: add SFF vendor prefix · 165da358
      Baruch Siach 提交于
      Acked-by: NRob Herring <robh@kernel.org>
      Signed-off-by: NBaruch Siach <baruch@tkos.co.il>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      165da358
    • B
      dt-bindings: net: don't confuse with generic PHY property · c43593d8
      Baruch Siach 提交于
      This complements commit 9a94b3a4 (dt-binding: phy: don't confuse with
      Ethernet phy properties).
      
      The generic PHY 'phys' property sometime appears in the same node with
      the Ethernet PHY 'phy' or 'phy-handle' properties. Add a warning in
      ethernet.txt to reduce confusion.
      Signed-off-by: NBaruch Siach <baruch@tkos.co.il>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c43593d8
    • R
      drivers/pps: aesthetic tweaks to PPS-related content · a2d81803
      Robert P. J. Day 提交于
      Collection of aesthetic adjustments to various PPS-related files,
      directories and Documentation, some quite minor just for the sake of
      consistency, including:
      
       * Updated example of pps device tree node (courtesy Rodolfo G.)
       * "PPS-API" -> "PPS API"
       * "pps_source_info_s" -> "pps_source_info"
       * "ktimer driver" -> "pps-ktimer driver"
       * "ppstest /dev/pps0" -> "ppstest /dev/pps1" to match example
       * Add missing PPS-related entries to MAINTAINERS file
       * Other trivialities
      
      Link: http://lkml.kernel.org/r/alpine.LFD.2.20.1708261048220.8106@localhost.localdomainSigned-off-by: NRobert P. J. Day <rpjday@crashcourse.ca>
      Acked-by: NRodolfo Giometti <giometti@enneenne.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2d81803
    • D
      rbtree: cache leftmost node internally · cd9e61ed
      Davidlohr Bueso 提交于
      Patch series "rbtree: Cache leftmost node internally", v4.
      
      A series to extending rbtrees to internally cache the leftmost node such
      that we can have fast overlap check optimization for all interval tree
      users[1].  The benefits of this series are that:
      
      (i)   Unify users that do internal leftmost node caching.
      (ii)  Optimize all interval tree users.
      (iii) Convert at least two new users (epoll and procfs) to the new interface.
      
      This patch (of 16):
      
      Red-black tree semantics imply that nodes with smaller or greater (or
      equal for duplicates) keys always be to the left and right,
      respectively.  For the kernel this is extremely evident when considering
      our rb_first() semantics.  Enabling lookups for the smallest node in the
      tree in O(1) can save a good chunk of cycles in not having to walk down
      the tree each time.  To this end there are a few core users that
      explicitly do this, such as the scheduler and rtmutexes.  There is also
      the desire for interval trees to have this optimization allowing faster
      overlap checking.
      
      This patch introduces a new 'struct rb_root_cached' which is just the
      root with a cached pointer to the leftmost node.  The reason why the
      regular rb_root was not extended instead of adding a new structure was
      that this allows the user to have the choice between memory footprint
      and actual tree performance.  The new wrappers on top of the regular
      rb_root calls are:
      
       - rb_first_cached(cached_root) -- which is a fast replacement
           for rb_first.
      
       - rb_insert_color_cached(node, cached_root, new)
      
       - rb_erase_cached(node, cached_root)
      
      In addition, augmented cached interfaces are also added for basic
      insertion and deletion operations; which becomes important for the
      interval tree changes.
      
      With the exception of the inserts, which adds a bool for updating the
      new leftmost, the interfaces are kept the same.  To this end, porting rb
      users to the cached version becomes really trivial, and keeping current
      rbtree semantics for users that don't care about the optimization
      requires zero overhead.
      
      Link: http://lkml.kernel.org/r/20170719014603.19029-2-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd9e61ed
    • J
      hmm: heterogeneous memory management documentation · bffc33ec
      Jérôme Glisse 提交于
      Patch series "HMM (Heterogeneous Memory Management)", v25.
      
      Heterogeneous Memory Management (HMM) (description and justification)
      
      Today device driver expose dedicated memory allocation API through their
      device file, often relying on a combination of IOCTL and mmap calls.
      The device can only access and use memory allocated through this API.
      This effectively split the program address space into object allocated
      for the device and useable by the device and other regular memory
      (malloc, mmap of a file, share memory, â) only accessible by
      CPU (or in a very limited way by a device by pinning memory).
      
      Allowing different isolated component of a program to use a device thus
      require duplication of the input data structure using device memory
      allocator.  This is reasonable for simple data structure (array, grid,
      image, â) but this get extremely complex with advance data
      structure (list, tree, graph, â) that rely on a web of memory
      pointers.  This is becoming a serious limitation on the kind of work
      load that can be offloaded to device like GPU.
      
      New industry standard like C++, OpenCL or CUDA are pushing to remove
      this barrier.  This require a shared address space between GPU device
      and CPU so that GPU can access any memory of a process (while still
      obeying memory protection like read only).  This kind of feature is also
      appearing in various other operating systems.
      
      HMM is a set of helpers to facilitate several aspects of address space
      sharing and device memory management.  Unlike existing sharing mechanism
      that rely on pining pages use by a device, HMM relies on mmu_notifier to
      propagate CPU page table update to device page table.
      
      Duplicating CPU page table is only one aspect necessary for efficiently
      using device like GPU.  GPU local memory have bandwidth in the TeraBytes/
      second range but they are connected to main memory through a system bus
      like PCIE that is limited to 32GigaBytes/second (PCIE 4.0 16x).  Thus it
      is necessary to allow migration of process memory from main system memory
      to device memory.  Issue is that on platform that only have PCIE the
      device memory is not accessible by the CPU with the same properties as
      main memory (cache coherency, atomic operations, ...).
      
      To allow migration from main memory to device memory HMM provides a set of
      helper to hotplug device memory as a new type of ZONE_DEVICE memory which
      is un-addressable by CPU but still has struct page representing it.  This
      allow most of the core kernel logic that deals with a process memory to
      stay oblivious of the peculiarity of device memory.
      
      When page backing an address of a process is migrated to device memory the
      CPU page table entry is set to a new specific swap entry.  CPU access to
      such address triggers a migration back to system memory, just like if the
      page was swap on disk.  HMM also blocks any one from pinning a ZONE_DEVICE
      page so that it can always be migrated back to system memory if CPU access
      it.  Conversely HMM does not migrate to device memory any page that is pin
      in system memory.
      
      To allow efficient migration between device memory and main memory a new
      migrate_vma() helpers is added with this patchset.  It allows to leverage
      device DMA engine to perform the copy operation.
      
      This feature will be use by upstream driver like nouveau mlx5 and probably
      other in the future (amdgpu is next suspect in line).  We are actively
      working on nouveau and mlx5 support.  To test this patchset we also worked
      with NVidia close source driver team, they have more resources than us to
      test this kind of infrastructure and also a bigger and better userspace
      eco-system with various real industry workload they can be use to test and
      profile HMM.
      
      The expected workload is a program builds a data set on the CPU (from
      disk, from network, from sensors, â).  Program uses GPU API (OpenCL,
      CUDA, ...) to give hint on memory placement for the input data and also
      for the output buffer.  Program call GPU API to schedule a GPU job, this
      happens using device driver specific ioctl.  All this is hidden from
      programmer point of view in case of C++ compiler that transparently
      offload some part of a program to GPU.  Program can keep doing other stuff
      on the CPU while the GPU is crunching numbers.
      
      It is expected that CPU will not access the same data set as the GPU while
      GPU is working on it, but this is not mandatory.  In fact we expect some
      small memory object to be actively access by both GPU and CPU concurrently
      as synchronization channel and/or for monitoring purposes.  Such object
      will stay in system memory and should not be bottlenecked by system bus
      bandwidth (rare write and read access from both CPU and GPU).
      
      As we are relying on device driver API, HMM does not introduce any new
      syscall nor does it modify any existing ones.  It does not change any
      POSIX semantics or behaviors.  For instance the child after a fork of a
      process that is using HMM will not be impacted in anyway, nor is there any
      data hazard between child COW or parent COW of memory that was migrated to
      device prior to fork.
      
      HMM assume a numbers of hardware features.  Device must allow device page
      table to be updated at any time (ie device job must be preemptable).
      Device page table must provides memory protection such as read only.
      Device must track write access (dirty bit).  Device must have a minimum
      granularity that match PAGE_SIZE (ie 4k).
      
      Reviewer (just hint):
      Patch 1  HMM documentation
      Patch 2  introduce core infrastructure and definition of HMM, pretty
               small patch and easy to review
      Patch 3  introduce the mirror functionality of HMM, it relies on
               mmu_notifier and thus someone familiar with that part would be
               in better position to review
      Patch 4  is an helper to snapshot CPU page table while synchronizing with
               concurrent page table update. Understanding mmu_notifier makes
               review easier.
      Patch 5  is mostly a wrapper around handle_mm_fault()
      Patch 6  add new add_pages() helper to avoid modifying each arch memory
               hot plug function
      Patch 7  add a new memory type for ZONE_DEVICE and also add all the logic
               in various core mm to support this new type. Dan Williams and
               any core mm contributor are best people to review each half of
               this patchset
      Patch 8  special case HMM ZONE_DEVICE pages inside put_page() Kirill and
               Dan Williams are best person to review this
      Patch 9  allow to uncharge a page from memory group without using the lru
               list field of struct page (best reviewer: Johannes Weiner or
               Vladimir Davydov or Michal Hocko)
      Patch 10 Add support to uncharge ZONE_DEVICE page from a memory cgroup (best
               reviewer: Johannes Weiner or Vladimir Davydov or Michal Hocko)
      Patch 11 add helper to hotplug un-addressable device memory as new type
               of ZONE_DEVICE memory (new type introducted in patch 3 of this
               serie). This is boiler plate code around memory hotplug and it
               also pick a free range of physical address for the device memory.
               Note that the physical address do not point to anything (at least
               as far as the kernel knows).
      Patch 12 introduce a new hmm_device class as an helper for device driver
               that want to expose multiple device memory under a common fake
               device driver. This is usefull for multi-gpu configuration.
               Anyone familiar with device driver infrastructure can review
               this. Boiler plate code really.
      Patch 13 add a new migrate mode. Any one familiar with page migration is
               welcome to review.
      Patch 14 introduce a new migration helper (migrate_vma()) that allow to
               migrate a range of virtual address of a process using device DMA
               engine to perform the copy. It is not limited to do copy from and
               to device but can also do copy between any kind of source and
               destination memory. Again anyone familiar with migration code
               should be able to verify the logic.
      Patch 15 optimize the new migrate_vma() by unmapping pages while we are
               collecting them. This can be review by any mm folks.
      Patch 16 add unaddressable memory migration to helper introduced in patch
               7, this can be review by anyone familiar with migration code
      Patch 17 add a feature that allow device to allocate non-present page on
               the GPU when migrating a range of address to device memory. This
               is an helper for device driver to avoid having to first allocate
               system memory before migration to device memory
      Patch 18 add a new kind of ZONE_DEVICE memory for cache coherent device
               memory (CDM)
      Patch 19 add an helper to hotplug CDM memory
      
      Previous patchset posting :
      v1 http://lwn.net/Articles/597289/
      v2 https://lkml.org/lkml/2014/6/12/559
      v3 https://lkml.org/lkml/2014/6/13/633
      v4 https://lkml.org/lkml/2014/8/29/423
      v5 https://lkml.org/lkml/2014/11/3/759
      v6 http://lwn.net/Articles/619737/
      v7 http://lwn.net/Articles/627316/
      v8 https://lwn.net/Articles/645515/
      v9 https://lwn.net/Articles/651553/
      v10 https://lwn.net/Articles/654430/
      v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424
      v12 http://www.kernelhub.org/?msg=972982&p=2
      v13 https://lwn.net/Articles/706856/
      v14 https://lkml.org/lkml/2016/12/8/344
      v15 http://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1304107.html
      v16 http://www.spinics.net/lists/linux-mm/msg119814.html
      v17 https://lkml.org/lkml/2017/1/27/847
      v18 https://lkml.org/lkml/2017/3/16/596
      v19 https://lkml.org/lkml/2017/4/5/831
      v20 https://lwn.net/Articles/720715/
      v21 https://lkml.org/lkml/2017/4/24/747
      v22 http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05176.html
      v23 https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1404788.html
      v24 https://lwn.net/Articles/726691/
      
      This patch (of 19):
      
      This adds documentation for HMM (Heterogeneous Memory Management).  It
      presents the motivation behind it, the features necessary for it to be
      useful and and gives an overview of how this is implemented.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-2-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bffc33ec
    • S
      kokr/memory-barriers.txt: Apply atomic_t.txt change · 6fad4e69
      SeongJae Park 提交于
      This commit applies memory-barriers.txt part of upstream change, commit
      706eeb3e ("Documentation/locking/atomic: Add documents for new
      atomic_t APIs") to Korean translation.
      Signed-off-by: NSeongJae Park <sj38.park@gmail.com>
      Signed-off-by: NJonathan Corbet <corbet@lwn.net>
      6fad4e69
    • S
      kokr/doc: Update memory-barriers.txt for read-to-write dependencies · 53e31538
      SeongJae Park 提交于
      This commit applies upstream change, commit 66ce3a4d ("doc: Update
      memory-barriers.txt for read-to-write dependencies") to Korean
      translation.
      Signed-off-by: NSeongJae Park <sj38.park@gmail.com>
      Signed-off-by: NJonathan Corbet <corbet@lwn.net>
      53e31538
    • M
      docs-rst: don't require adjustbox anymore · 54d6d73f
      Mauro Carvalho Chehab 提交于
      Only the media PDF book was requiring adjustbox, in order to
      scale big tables. That worked pretty good with Sphinx versions
      1.4 and 1.5, but Spinx 1.6 changed the way tables are produced,
      by introducing some weird macros before tabulary.
      That causes adjustbox to fail. So, it can't be used anymore,
      and its usage was removed from the media book.
      
      So, let's remove it from conf.py and sphinx-pre-install.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@s-opensource.com>
      Signed-off-by: NJonathan Corbet <corbet@lwn.net>
      54d6d73f
    • M
      docs-rst: conf.py: only setup notice box colors if Sphinx < 1.6 · 9fdcd6af
      Mauro Carvalho Chehab 提交于
      Sphinx 1.5 added a new way to change backward colors for note
      boxes, but kept backward compatibility with 1.4. On Sphinx 1.6,
      the old way stopped working, in favor of a new less hackish
      way.
      
      Unfortunately, this is currently too buggy to be used, and
      the old way doesn't work anymore. So, we have no option but
      to stick with boring notice boxes.
      
      One example of such bug is the notice that it is inside
      struct v4l2_plane, at the "bytesused" field.
      
      At least, add a notice about how to use, as maybe some day
      the bug will vanish.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@s-opensource.com>
      Signed-off-by: NJonathan Corbet <corbet@lwn.net>
      9fdcd6af
    • M
      docs-rst: conf.py: remove lscape from LaTeX preamble · c4b326e1
      Mauro Carvalho Chehab 提交于
      Only the media book used this extension in the past, but
      it is not required anymore.
      
      Cleanup patch only.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@s-opensource.com>
      Signed-off-by: NJonathan Corbet <corbet@lwn.net>
      c4b326e1
  6. 07 9月, 2017 8 次提交
    • B
      dt-binding: phy: don't confuse with Ethernet phy properties · 9a94b3a4
      Baruch Siach 提交于
      The generic PHY 'phys' property sometime appears in the same node with
      the Ethernet PHY 'phy' or 'phy-handle' properties. Add a warning in
      phy-bindings.txt to reduce confusion.
      Signed-off-by: NBaruch Siach <baruch@tkos.co.il>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a94b3a4
    • D
      mm: add /proc/pid/smaps_rollup · 493b0e9d
      Daniel Colascione 提交于
      /proc/pid/smaps_rollup is a new proc file that improves the performance
      of user programs that determine aggregate memory statistics (e.g., total
      PSS) of a process.
      
      Android regularly "samples" the memory usage of various processes in
      order to balance its memory pool sizes.  This sampling process involves
      opening /proc/pid/smaps and summing certain fields.  For very large
      processes, sampling memory use this way can take several hundred
      milliseconds, due mostly to the overhead of the seq_printf calls in
      task_mmu.c.
      
      smaps_rollup improves the situation.  It contains most of the fields of
      /proc/pid/smaps, but instead of a set of fields for each VMA,
      smaps_rollup instead contains one synthetic smaps-format entry
      representing the whole process.  In the single smaps_rollup synthetic
      entry, each field is the summation of the corresponding field in all of
      the real-smaps VMAs.  Using a common format for smaps_rollup and smaps
      allows userspace parsers to repurpose parsers meant for use with
      non-rollup smaps for smaps_rollup, and it allows userspace to switch
      between smaps_rollup and smaps at runtime (say, based on the
      availability of smaps_rollup in a given kernel) with minimal fuss.
      
      By using smaps_rollup instead of smaps, a caller can avoid the
      significant overhead of formatting, reading, and parsing each of a large
      process's potentially very numerous memory mappings.  For sampling
      system_server's PSS in Android, we measured a 12x speedup, representing
      a savings of several hundred milliseconds.
      
      One alternative to a new per-process proc file would have been including
      PSS information in /proc/pid/status.  We considered this option but
      thought that PSS would be too expensive (by a few orders of magnitude)
      to collect relative to what's already emitted as part of
      /proc/pid/status, and slowing every user of /proc/pid/status for the
      sake of readers that happen to want PSS feels wrong.
      
      The code itself works by reusing the existing VMA-walking framework we
      use for regular smaps generation and keeping the mem_size_stats
      structure around between VMA walks instead of using a fresh one for each
      VMA.  In this way, summation happens automatically.  We let seq_file
      walk over the VMAs just as it does for regular smaps and just emit
      nothing to the seq_file until we hit the last VMA.
      
      Benchmarks:
      
          using smaps:
          iterations:1000 pid:1163 pss:220023808
          0m29.46s real 0m08.28s user 0m20.98s system
      
          using smaps_rollup:
          iterations:1000 pid:1163 pss:220702720
          0m04.39s real 0m00.03s user 0m04.31s system
      
      We're using the PSS samples we collect asynchronously for
      system-management tasks like fine-tuning oom_adj_score, memory use
      tracking for debugging, application-level memory-use attribution, and
      deciding whether we want to kill large processes during system idle
      maintenance windows.  Android has been using PSS for these purposes for
      a long time; as the average process VMA count has increased and and
      devices become more efficiency-conscious, PSS-collection inefficiency
      has started to matter more.  IMHO, it'd be a lot safer to optimize the
      existing PSS-collection model, which has been fine-tuned over the years,
      instead of changing the memory tracking approach entirely to work around
      smaps-generation inefficiency.
      
      Tim said:
      
      : There are two main reasons why Android gathers PSS information:
      :
      : 1. Android devices can show the user the amount of memory used per
      :    application via the settings app.  This is a less important use case.
      :
      : 2. We log PSS to help identify leaks in applications.  We have found
      :    an enormous number of bugs (in the Android platform, in Google's own
      :    apps, and in third-party applications) using this data.
      :
      : To do this, system_server (the main process in Android userspace) will
      : sample the PSS of a process three seconds after it changes state (for
      : example, app is launched and becomes the foreground application) and about
      : every ten minutes after that.  The net result is that PSS collection is
      : regularly running on at least one process in the system (usually a few
      : times a minute while the screen is on, less when screen is off due to
      : suspend).  PSS of a process is an incredibly useful stat to track, and we
      : aren't going to get rid of it.  We've looked at some very hacky approaches
      : using RSS ("take the RSS of the target process, subtract the RSS of the
      : zygote process that is the parent of all Android apps") to reduce the
      : accounting time, but it regularly overestimated the memory used by 20+
      : percent.  Accordingly, I don't think that there's a good alternative to
      : using PSS.
      :
      : We started looking into PSS collection performance after we noticed random
      : frequency spikes while a phone's screen was off; occasionally, one of the
      : CPU clusters would ramp to a high frequency because there was 200-300ms of
      : constant CPU work from a single thread in the main Android userspace
      : process.  The work causing the spike (which is reasonable governor
      : behavior given the amount of CPU time needed) was always PSS collection.
      : As a result, Android is burning more power than we should be on PSS
      : collection.
      :
      : The other issue (and why I'm less sure about improving smaps as a
      : long-term solution) is that the number of VMAs per process has increased
      : significantly from release to release.  After trying to figure out why we
      : were seeing these 200-300ms PSS collection times on Android O but had not
      : noticed it in previous versions, we found that the number of VMAs in the
      : main system process increased by 50% from Android N to Android O (from
      : ~1800 to ~2700) and varying increases in every userspace process.  Android
      : M to N also had an increase in the number of VMAs, although not as much.
      : I'm not sure why this is increasing so much over time, but thinking about
      : ASLR and ways to make ASLR better, I expect that this will continue to
      : increase going forward.  I would not be surprised if we hit 5000 VMAs on
      : the main Android process (system_server) by 2020.
      :
      : If we assume that the number of VMAs is going to increase over time, then
      : doing anything we can do to reduce the overhead of each VMA during PSS
      : collection seems like the right way to go, and that means outputting an
      : aggregate statistic (to avoid whatever overhead there is per line in
      : writing smaps and in reading each line from userspace).
      
      Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.comSigned-off-by: NDaniel Colascione <dancol@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sonny Rao <sonnyrao@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      493b0e9d
    • A
      swap: choose swap device according to numa node · a2468cc9
      Aaron Lu 提交于
      If the system has more than one swap device and swap device has the node
      information, we can make use of this information to decide which swap
      device to use in get_swap_pages() to get better performance.
      
      The current code uses a priority based list, swap_avail_list, to decide
      which swap device to use and if multiple swap devices share the same
      priority, they are used round robin.  This patch changes the previous
      single global swap_avail_list into a per-numa-node list, i.e.  for each
      numa node, it sees its own priority based list of available swap
      devices.  Swap device's priority can be promoted on its matching node's
      swap_avail_list.
      
      The current swap device's priority is set as: user can set a >=0 value,
      or the system will pick one starting from -1 then downwards.  The
      priority value in the swap_avail_list is the negated value of the swap
      device's due to plist being sorted from low to high.  The new policy
      doesn't change the semantics for priority >=0 cases, the previous
      starting from -1 then downwards now becomes starting from -2 then
      downwards and -1 is reserved as the promoted value.
      
      Take 4-node EX machine as an example, suppose 4 swap devices are
      available, each sit on a different node:
      swapA on node 0
      swapB on node 1
      swapC on node 2
      swapD on node 3
      
      After they are all swapped on in the sequence of ABCD.
      
      Current behaviour:
      their priorities will be:
      swapA: -1
      swapB: -2
      swapC: -3
      swapD: -4
      And their position in the global swap_avail_list will be:
      swapA   -> swapB   -> swapC   -> swapD
      prio:1     prio:2     prio:3     prio:4
      
      New behaviour:
      their priorities will be(note that -1 is skipped):
      swapA: -2
      swapB: -3
      swapC: -4
      swapD: -5
      And their positions in the 4 swap_avail_lists[nid] will be:
      swap_avail_lists[0]: /* node 0's available swap device list */
      swapA   -> swapB   -> swapC   -> swapD
      prio:1     prio:3     prio:4     prio:5
      swap_avali_lists[1]: /* node 1's available swap device list */
      swapB   -> swapA   -> swapC   -> swapD
      prio:1     prio:2     prio:4     prio:5
      swap_avail_lists[2]: /* node 2's available swap device list */
      swapC   -> swapA   -> swapB   -> swapD
      prio:1     prio:2     prio:3     prio:5
      swap_avail_lists[3]: /* node 3's available swap device list */
      swapD   -> swapA   -> swapB   -> swapC
      prio:1     prio:2     prio:3     prio:4
      
      To see the effect of the patch, a test that starts N process, each mmap
      a region of anonymous memory and then continually write to it at random
      position to trigger both swap in and out is used.
      
      On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
      are used as swap devices with each attached to a different node, the
      result is:
      
      runtime=30m/processes=32/total test size=128G/each process mmap region=4G
      kernel         throughput
      vanilla        13306
      auto-binding   15169 +14%
      
      runtime=30m/processes=64/total test size=128G/each process mmap region=2G
      kernel         throughput
      vanilla        11885
      auto-binding   14879 +25%
      
      [aaron.lu@intel.com: v2]
        Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
        Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
      [akpm@linux-foundation.org: use kmalloc_array()]
      Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
      Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Cc: "Chen, Tim C" <tim.c.chen@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2468cc9
    • H
      mm, swap: add sysfs interface for VMA based swap readahead · d9bfcfdc
      Huang Ying 提交于
      The sysfs interface to control the VMA based swap readahead is added as
      follow,
      
      /sys/kernel/mm/swap/vma_ra_enabled
      
      Enable the VMA based swap readahead algorithm, or use the original
      global swap readahead algorithm.
      
      /sys/kernel/mm/swap/vma_ra_max_order
      
      Set the max order of the readahead window size for the VMA based swap
      readahead algorithm.
      
      The corresponding ABI documentation is added too.
      
      Link: http://lkml.kernel.org/r/20170807054038.1843-5-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9bfcfdc
    • J
      fscache: remove unused ->now_uncached callback · 26b433d0
      Jan Kara 提交于
      Patch series "Ranged pagevec lookup", v2.
      
      In this series I make pagevec_lookup() update the index (to be
      consistent with pagevec_lookup_tag() and also as a preparation for
      ranged lookups), provide ranged variant of pagevec_lookup() and use it
      in places where it makes sense.  This not only removes some common code
      but is also a measurable performance win for some use cases (see patch
      4/10) where radix tree is sparse and searching & grabing of a page after
      the end of the range has measurable overhead.
      
      This patch (of 10):
      
      The callback doesn't ever get called.  Remove it.
      
      Link: http://lkml.kernel.org/r/20170726114704.7626-2-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26b433d0
    • M
      mm, page_alloc: rip out ZONELIST_ORDER_ZONE · c9bff3ee
      Michal Hocko 提交于
      Patch series "cleanup zonelists initialization", v1.
      
      This is aimed at cleaning up the zonelists initialization code we have
      but the primary motivation was bug report [2] which got resolved but the
      usage of stop_machine is just too ugly to live.  Most patches are
      straightforward but 3 of them need a special consideration.
      
      Patch 1 removes zone ordered zonelists completely.  I am CCing linux-api
      because this is a user visible change.  As I argue in the patch
      description I do not think we have a strong usecase for it these days.
      I have kept sysctl in place and warn into the log if somebody tries to
      configure zone lists ordering.  If somebody has a real usecase for it we
      can revert this patch but I do not expect anybody will actually notice
      runtime differences.  This patch is not strictly needed for the rest but
      it made patch 6 easier to implement.
      
      Patch 7 removes stop_machine from build_all_zonelists without adding any
      special synchronization between iterators and updater which I _believe_
      is acceptable as explained in the changelog.  I hope I am not missing
      anything.
      
      Patch 8 then removes zonelists_mutex which is kind of ugly as well and
      not really needed AFAICS but a care should be taken when double checking
      my thinking.
      
      This patch (of 9):
      
      Supporting zone ordered zonelists costs us just a lot of code while the
      usefulness is arguable if existent at all.  Mel has already made node
      ordering default on 64b systems.  32b systems are still using
      ZONELIST_ORDER_ZONE because it is considered better to fallback to a
      different NUMA node rather than consume precious lowmem zones.
      
      This argument is, however, weaken by the fact that the memory reclaim
      has been reworked to be node rather than zone oriented.  This means that
      lowmem requests have to skip over all highmem pages on LRUs already and
      so zone ordering doesn't save the reclaim time much.  So the only
      advantage of the zone ordering is under a light memory pressure when
      highmem requests do not ever hit into lowmem zones and the lowmem
      pressure doesn't need to reclaim.
      
      Considering that 32b NUMA systems are rather suboptimal already and it
      is generally advisable to use 64b kernel on such a HW I believe we
      should rather care about the code maintainability and just get rid of
      ZONELIST_ORDER_ZONE altogether.  Keep systcl in place and warn if
      somebody tries to set zone ordering either from kernel command line or
      the sysctl.
      
      [mhocko@suse.com: reading vm.numa_zonelist_order will never terminate]
      Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9bff3ee
    • M
      zram: add config and doc file for writeback feature · 5a47074f
      Minchan Kim 提交于
      This patch adds document and kconfig for using of writeback feature.
      
      Link: http://lkml.kernel.org/r/1498459987-24562-10-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a47074f
    • R
      dax: use common 4k zero page for dax mmap reads · 91d25ba8
      Ross Zwisler 提交于
      When servicing mmap() reads from file holes the current DAX code
      allocates a page cache page of all zeroes and places the struct page
      pointer in the mapping->page_tree radix tree.
      
      This has three major drawbacks:
      
      1) It consumes memory unnecessarily. For every 4k page that is read via
         a DAX mmap() over a hole, we allocate a new page cache page. This
         means that if you read 1GiB worth of pages, you end up using 1GiB of
         zeroed memory. This is easily visible by looking at the overall
         memory consumption of the system or by looking at /proc/[pid]/smaps:
      
      	7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:             1048576 kB
      	Pss:             1048576 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:   1048576 kB
      	Private_Dirty:         0 kB
      	Referenced:      1048576 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      2) It is slower than using a common zero page because each page fault
         has more work to do. Instead of just inserting a common zero page we
         have to allocate a page cache page, zero it, and then insert it. Here
         are the average latencies of dax_load_hole() as measured by ftrace on
         a random test box:
      
          Old method, using zeroed page cache pages:	3.4 us
          New method, using the common 4k zero page:	0.8 us
      
         This was the average latency over 1 GiB of sequential reads done by
         this simple fio script:
      
           [global]
           size=1G
           filename=/root/dax/data
           fallocate=none
           [io]
           rw=read
           ioengine=mmap
      
      3) The fact that we had to check for both DAX exceptional entries and
         for page cache pages in the radix tree made the DAX code more
         complex.
      
      Solve these issues by following the lead of the DAX PMD code and using a
      common 4k zero page instead.  As with the PMD code we will now insert a
      DAX exceptional entry into the radix tree instead of a struct page
      pointer which allows us to remove all the special casing in the DAX
      code.
      
      Note that we do still pretty aggressively check for regular pages in the
      DAX radix tree, especially where we take action based on the bits set in
      the page.  If we ever find a regular page in our radix tree now that
      most likely means that someone besides DAX is inserting pages (which has
      happened lots of times in the past), and we want to find that out early
      and fail loudly.
      
      This solution also removes the extra memory consumption.  Here is that
      same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
      code:
      
      	7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:                   0 kB
      	Pss:                   0 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:         0 kB
      	Private_Dirty:         0 kB
      	Referenced:            0 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      Overall system memory consumption is similarly improved.
      
      Another major change is that we remove dax_pfn_mkwrite() from our fault
      flow, and instead rely on the page fault itself to make the PTE dirty
      and writeable.  The following description from the patch adding the
      vm_insert_mixed_mkwrite() call explains this a little more:
      
         "To be able to use the common 4k zero page in DAX we need to have our
          PTE fault path look more like our PMD fault path where a PTE entry
          can be marked as dirty and writeable as it is first inserted rather
          than waiting for a follow-up dax_pfn_mkwrite() =>
          finish_mkwrite_fault() call.
      
          Right now we can rely on having a dax_pfn_mkwrite() call because we
          can distinguish between these two cases in do_wp_page():
      
                  case 1: 4k zero page => writable DAX storage
                  case 2: read-only DAX storage => writeable DAX storage
      
          This distinction is made by via vm_normal_page(). vm_normal_page()
          returns false for the common 4k zero page, though, just as it does
          for DAX ptes. Instead of special casing the DAX + 4k zero page case
          we will simplify our DAX PTE page fault sequence so that it matches
          our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
          We will instead use dax_iomap_fault() to handle write-protection
          faults.
      
          This means that insert_pfn() needs to follow the lead of
          insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
          'mkwrite' is set insert_pfn() will do the work that was previously
          done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91d25ba8
  7. 05 9月, 2017 11 次提交