1. 15 3月, 2022 2 次提交
  2. 20 1月, 2022 1 次提交
  3. 19 10月, 2021 1 次提交
  4. 18 10月, 2021 3 次提交
  5. 27 9月, 2021 1 次提交
  6. 03 3月, 2021 1 次提交
  7. 25 2月, 2021 1 次提交
  8. 28 1月, 2021 1 次提交
  9. 25 1月, 2021 1 次提交
  10. 03 12月, 2020 1 次提交
    • R
      mm: memcontrol: Use helpers to read page's memcg data · bcfe06bf
      Roman Gushchin 提交于
      Patch series "mm: allow mapping accounted kernel pages to userspace", v6.
      
      Currently a non-slab kernel page which has been charged to a memory cgroup
      can't be mapped to userspace.  The underlying reason is simple: PageKmemcg
      flag is defined as a page type (like buddy, offline, etc), so it takes a
      bit from a page->mapped counter.  Pages with a type set can't be mapped to
      userspace.
      
      But in general the kmemcg flag has nothing to do with mapping to
      userspace.  It only means that the page has been accounted by the page
      allocator, so it has to be properly uncharged on release.
      
      Some bpf maps are mapping the vmalloc-based memory to userspace, and their
      memory can't be accounted because of this implementation detail.
      
      This patchset removes this limitation by moving the PageKmemcg flag into
      one of the free bits of the page->mem_cgroup pointer.  Also it formalizes
      accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
      adds several checks and removes a couple of obsolete functions.  As the
      result the code became more robust with fewer open-coded bit tricks.
      
      This patch (of 4):
      
      Currently there are many open-coded reads of the page->mem_cgroup pointer,
      as well as a couple of read helpers, which are barely used.
      
      It creates an obstacle on a way to reuse some bits of the pointer for
      storing additional bits of information.  In fact, we already do this for
      slab pages, where the last bit indicates that a pointer has an attached
      vector of objcg pointers instead of a regular memcg pointer.
      
      This commits uses 2 existing helpers and introduces a new helper to
      converts all read sides to calls of these helpers:
        struct mem_cgroup *page_memcg(struct page *page);
        struct mem_cgroup *page_memcg_rcu(struct page *page);
        struct mem_cgroup *page_memcg_check(struct page *page);
      
      page_memcg_check() is intended to be used in cases when the page can be a
      slab page and have a memcg pointer pointing at objcg vector.  It does
      check the lowest bit, and if set, returns NULL.  page_memcg() contains a
      VM_BUG_ON_PAGE() check for the page not being a slab page.
      
      To make sure nobody uses a direct access, struct page's
      mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
      Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
      Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com
      bcfe06bf
  11. 14 10月, 2020 2 次提交
  12. 25 9月, 2020 1 次提交
  13. 04 9月, 2020 1 次提交
    • S
      mm: Add arch hooks for saving/restoring tags · 8a84802e
      Steven Price 提交于
      Arm's Memory Tagging Extension (MTE) adds some metadata (tags) to
      every physical page, when swapping pages out to disk it is necessary to
      save these tags, and later restore them when reading the pages back.
      
      Add some hooks along with dummy implementations to enable the
      arch code to handle this.
      
      Three new hooks are added to the swap code:
       * arch_prepare_to_swap() and
       * arch_swap_invalidate_page() / arch_swap_invalidate_area().
      One new hook is added to shmem:
       * arch_swap_restore()
      Signed-off-by: NSteven Price <steven.price@arm.com>
      [catalin.marinas@arm.com: add unlock_page() on the error path]
      [catalin.marinas@arm.com: dropped the _tags suffix]
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      8a84802e
  14. 15 8月, 2020 3 次提交
  15. 08 8月, 2020 1 次提交
  16. 29 6月, 2020 1 次提交
  17. 10 6月, 2020 1 次提交
    • M
      mm: don't include asm/pgtable.h if linux/mm.h is already included · e31cf2f4
      Mike Rapoport 提交于
      Patch series "mm: consolidate definitions of page table accessors", v2.
      
      The low level page table accessors (pXY_index(), pXY_offset()) are
      duplicated across all architectures and sometimes more than once.  For
      instance, we have 31 definition of pgd_offset() for 25 supported
      architectures.
      
      Most of these definitions are actually identical and typically it boils
      down to, e.g.
      
      static inline unsigned long pmd_index(unsigned long address)
      {
              return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
      }
      
      static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
      {
              return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
      }
      
      These definitions can be shared among 90% of the arches provided
      XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
      
      For architectures that really need a custom version there is always
      possibility to override the generic version with the usual ifdefs magic.
      
      These patches introduce include/linux/pgtable.h that replaces
      include/asm-generic/pgtable.h and add the definitions of the page table
      accessors to the new header.
      
      This patch (of 12):
      
      The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
      functions involving page table manipulations, e.g.  pte_alloc() and
      pmd_alloc().  So, there is no point to explicitly include <asm/pgtable.h>
      in the files that include <linux/mm.h>.
      
      The include statements in such cases are remove with a simple loop:
      
      	for f in $(git grep -l "include <linux/mm.h>") ; do
      		sed -i -e '/include <asm\/pgtable.h>/ d' $f
      	done
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
      Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e31cf2f4
  18. 03 2月, 2020 1 次提交
    • C
      fs: Enable bmap() function to properly return errors · 30460e1e
      Carlos Maiolino 提交于
      By now, bmap() will either return the physical block number related to
      the requested file offset or 0 in case of error or the requested offset
      maps into a hole.
      This patch makes the needed changes to enable bmap() to proper return
      errors, using the return value as an error return, and now, a pointer
      must be passed to bmap() to be filled with the mapped physical block.
      
      It will change the behavior of bmap() on return:
      
      - negative value in case of error
      - zero on success or map fell into a hole
      
      In case of a hole, the *block will be zero too
      
      Since this is a prep patch, by now, the only error return is -EINVAL if
      ->bmap doesn't exist.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      30460e1e
  19. 02 12月, 2019 1 次提交
  20. 16 11月, 2019 1 次提交
  21. 13 7月, 2019 1 次提交
    • A
      mm, swap: use rbtree for swap_extent · 4efaceb1
      Aaron Lu 提交于
      swap_extent is used to map swap page offset to backing device's block
      offset.  For a continuous block range, one swap_extent is used and all
      these swap_extents are managed in a linked list.
      
      These swap_extents are used by map_swap_entry() during swap's read and
      write path.  To find out the backing device's block offset for a page
      offset, the swap_extent list will be traversed linearly, with
      curr_swap_extent being used as a cache to speed up the search.
      
      This works well as long as swap_extents are not huge or when the number
      of processes that access swap device are few, but when the swap device
      has many extents and there are a number of processes accessing the swap
      device concurrently, it can be a problem.  On one of our servers, the
      disk's remaining size is tight:
      
        $df -h
        Filesystem      Size  Used Avail Use% Mounted on
        ... ...
        /dev/nvme0n1p1  1.8T  1.3T  504G  72% /home/t4
      
      When creating a 80G swapfile there, there are as many as 84656 swap
      extents.  The end result is, kernel spends abou 30% time in
      map_swap_entry() and swap throughput is only 70MB/s.
      
      As a comparison, when I used smaller sized swapfile, like 4G whose
      swap_extent dropped to 2000, swap throughput is back to 400-500MB/s and
      map_swap_entry() is about 3%.
      
      One downside of using rbtree for swap_extent is, 'struct rbtree' takes
      24 bytes while 'struct list_head' takes 16 bytes, that's 8 bytes more
      for each swap_extent.  For a swapfile that has 80k swap_extents, that
      means 625KiB more memory consumed.
      
      Test:
      
      Since it's not possible to reboot that server, I can not test this patch
      diretly there.  Instead, I tested it on another server with NVMe disk.
      
      I created a 20G swapfile on an NVMe backed XFS fs.  By default, the
      filesystem is quite clean and the created swapfile has only 2 extents.
      Testing vanilla and this patch shows no obvious performance difference
      when swapfile is not fragmented.
      
      To see the patch's effects, I used some tweaks to manually fragment the
      swapfile by breaking the extent at 1M boundary.  This made the swapfile
      have 20K extents.
      
        nr_task=4
        kernel   swapout(KB/s) map_swap_entry(perf)  swapin(KB/s) map_swap_entry(perf)
        vanilla  165191           90.77%             171798          90.21%
        patched  858993 +420%      2.16%             715827 +317%     0.77%
      
        nr_task=8
        kernel   swapout(KB/s) map_swap_entry(perf)  swapin(KB/s) map_swap_entry(perf)
        vanilla  306783           92.19%             318145          87.76%
        patched  954437 +211%      2.35%            1073741 +237%     1.57%
      
      swapout: the throughput of swap out, in KB/s, higher is better 1st
      map_swap_entry: cpu cycles percent sampled by perf swapin: the
      throughput of swap in, in KB/s, higher is better.  2nd map_swap_entry:
      cpu cycles percent sampled by perf
      
      nr_task=1 doesn't show any difference, this is due to the curr_swap_extent
      can be effectively used to cache the correct swap extent for single task
      workload.
      
      [akpm@linux-foundation.org: s/BUG_ON(1)/BUG()/]
      Link: http://lkml.kernel.org/r/20190523142404.GA181@aaronluSigned-off-by: NAaron Lu <ziqian.lzq@antfin.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4efaceb1
  22. 05 7月, 2019 1 次提交
  23. 29 6月, 2019 1 次提交
    • H
      mm, swap: fix THP swap out · 1a5f439c
      Huang Ying 提交于
      0-Day test system reported some OOM regressions for several THP
      (Transparent Huge Page) swap test cases.  These regressions are bisected
      to 68614289 ("block: always define BIO_MAX_PAGES as 256").  In the
      commit, BIO_MAX_PAGES is set to 256 even when THP swap is enabled.  So the
      bio_alloc(gfp_flags, 512) in get_swap_bio() may fail when swapping out
      THP.  That causes the OOM.
      
      As in the patch description of 68614289 ("block: always define
      BIO_MAX_PAGES as 256"), THP swap should use multi-page bvec to write THP
      to swap space.  So the issue is fixed via doing that in get_swap_bio().
      
      BTW: I remember I have checked the THP swap code when 68614289
      ("block: always define BIO_MAX_PAGES as 256") was merged, and thought the
      THP swap code needn't to be changed.  But apparently, I was wrong.  I
      should have done this at that time.
      
      Link: http://lkml.kernel.org/r/20190624075515.31040-1-ying.huang@intel.com
      Fixes: 68614289 ("block: always define BIO_MAX_PAGES as 256")
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a5f439c
  24. 05 1月, 2019 1 次提交
  25. 03 1月, 2019 1 次提交
    • L
      block: don't use un-ordered __set_current_state(TASK_UNINTERRUPTIBLE) · 1ac5cd49
      Linus Torvalds 提交于
      This mostly reverts commit 849a3700 ("block: avoid ordered task
      state change for polled IO").  It was wrongly claiming that the ordering
      wasn't necessary.  The memory barrier _is_ necessary.
      
      If something is truly polling and not going to sleep, it's the whole
      state setting that is unnecessary, not the memory barrier.  Whenever you
      set your state to a sleeping state, you absolutely need the memory
      barrier.
      
      Note that sometimes the memory barrier can be elsewhere.  For example,
      the ordering might be provided by an external lock, or by setting the
      process state to sleeping before adding yourself to the wait queue list
      that is used for waking up (where the wait queue lock itself will
      guarantee that any wakeup will correctly see the sleeping state).
      
      But none of those cases were true here.
      
      NOTE! Some of the polling paths may indeed be able to drop the state
      setting entirely, at which point the memory barrier also goes away.
      
      (Also note that this doesn't revert the TASK_RUNNING cases: there is no
      race between a wakeup and setting the process state to TASK_RUNNING,
      since the end result doesn't depend on ordering).
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ac5cd49
  26. 08 12月, 2018 1 次提交
  27. 26 11月, 2018 1 次提交
    • J
      block: make blk_poll() take a parameter on whether to spin or not · 0a1b8b87
      Jens Axboe 提交于
      blk_poll() has always kept spinning until it found an IO. This is
      fine for SYNC polling, since we need to find one request we have
      pending, but in preparation for ASYNC polling it can be beneficial
      to just check if we have any entries available or not.
      
      Existing callers are converted to pass in 'spin == true', to retain
      the old behavior.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0a1b8b87
  28. 19 11月, 2018 1 次提交
  29. 16 11月, 2018 1 次提交
  30. 02 11月, 2018 1 次提交
  31. 27 10月, 2018 1 次提交
  32. 24 10月, 2018 1 次提交
    • D
      iov_iter: Separate type from direction and use accessor functions · aa563d7b
      David Howells 提交于
      In the iov_iter struct, separate the iterator type from the iterator
      direction and use accessor functions to access them in most places.
      
      Convert a bunch of places to use switch-statements to access them rather
      then chains of bitwise-AND statements.  This makes it easier to add further
      iterator types.  Also, this can be more efficient as to implement a switch
      of small contiguous integers, the compiler can use ~50% fewer compare
      instructions than it has to use bitwise-and instructions.
      
      Further, cease passing the iterator type into the iterator setup function.
      The iterator function can set that itself.  Only the direction is required.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      aa563d7b
  33. 22 9月, 2018 1 次提交
  34. 09 7月, 2018 1 次提交