1. 09 6月, 2020 1 次提交
  2. 06 6月, 2020 2 次提交
    • E
      firmware/dmi: Report DMI Bios & EC firmware release · f5152f4d
      Erwan Velu 提交于
      Some vendors like HPe or Dell, encode the release version of their BIOS
      in the "System BIOS {Major|Minor} Release" fields of Type 0.
      
      This information is used to know which bios release actually runs.
      It could be used for some quirks, debugging sessions or inventory tasks.
      
      A typical output for a Dell system running the 65.27 bios is :
      	[root@t1700 ~]# cat /sys/devices/virtual/dmi/id/bios_release
      	65.27
      	[root@t1700 ~]#
      
      Servers that have a BMC encode the release version of their firmware in the
       "Embedded Controller Firmware {Major|Minor} Release" fields of Type 0.
      
      This information is used to know which BMC release actually runs.
      It could be used for some quirks, debugging sessions or inventory tasks.
      
      A typical output for a Dell system running the 3.75 bmc release is :
          [root@t1700 ~]# cat /sys/devices/virtual/dmi/id/ec_firmware_release
          3.75
          [root@t1700 ~]#
      Signed-off-by: NErwan Velu <e.velu@criteo.com>
      Signed-off-by: NJean Delvare <jdelvare@suse.de>
      f5152f4d
    • M
      dm bufio: introduce forget_buffer_locked · 33a18062
      Mikulas Patocka 提交于
      Introduce a function forget_buffer_locked that forgets a range of
      buffers. It is more efficient than calling forget_buffer in a loop.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      33a18062
  3. 05 6月, 2020 21 次提交
  4. 04 6月, 2020 16 次提交
    • D
      afs: Add a tracepoint to track the lifetime of the afs_volume struct · cca37d45
      David Howells 提交于
      Add a tracepoint to track the lifetime of the afs_volume struct.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cca37d45
    • D
      afs: Implement client support for the YFSVL.GetCellName RPC op · c3e9f888
      David Howells 提交于
      Implement client support for the YFSVL.GetCellName RPC operation by which
      YFS permits the canonical cell name to be queried from a VL server.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c3e9f888
    • D
      afs: Build an abstraction around an "operation" concept · e49c7b2f
      David Howells 提交于
      Turn the afs_operation struct into the main way that most fileserver
      operations are managed.  Various things are added to the struct, including
      the following:
      
       (1) All the parameters and results of the relevant operations are moved
           into it, removing corresponding fields from the afs_call struct.
           afs_call gets a pointer to the op.
      
       (2) The target volume is made the main focus of the operation, rather than
           the target vnode(s), and a bunch of op->vnode->volume are made
           op->volume instead.
      
       (3) Two vnode records are defined (op->file[]) for the vnode(s) involved
           in most operations.  The vnode record (struct afs_vnode_param)
           contains:
      
      	- The vnode pointer.
      
      	- The fid of the vnode to be included in the parameters or that was
                returned in the reply (eg. FS.MakeDir).
      
      	- The status and callback information that may be returned in the
           	  reply about the vnode.
      
      	- Callback break and data version tracking for detecting
                simultaneous third-parth changes.
      
       (4) Pointers to dentries to be updated with new inodes.
      
       (5) An operations table pointer.  The table includes pointers to functions
           for issuing AFS and YFS-variant RPCs, handling the success and abort
           of an operation and handling post-I/O-lock local editing of a
           directory.
      
      To make this work, the following function restructuring is made:
      
       (A) The rotation loop that issues calls to fileservers that can be found
           in each function that wants to issue an RPC (such as afs_mkdir()) is
           extracted out into common code, in a new file called fs_operation.c.
      
       (B) The rotation loops, such as the one in afs_mkdir(), are replaced with
           a much smaller piece of code that allocates an operation, sets the
           parameters and then calls out to the common code to do the actual
           work.
      
       (C) The code for handling the success and failure of an operation are
           moved into operation functions (as (5) above) and these are called
           from the core code at appropriate times.
      
       (D) The pseudo inode getting stuff used by the dynamic root code is moved
           over into dynroot.c.
      
       (E) struct afs_iget_data is absorbed into the operation struct and
           afs_iget() expects to be given an op pointer and a vnode record.
      
       (F) Point (E) doesn't work for the root dir of a volume, but we know the
           FID in advance (it's always vnode 1, unique 1), so a separate inode
           getter, afs_root_iget(), is provided to special-case that.
      
       (G) The inode status init/update functions now also take an op and a vnode
           record.
      
       (H) The RPC marshalling functions now, for the most part, just take an
           afs_operation struct as their only argument.  All the data they need
           is held there.  The result delivery functions write their answers
           there as well.
      
       (I) The call is attached to the operation and then the operation core does
           the waiting.
      
      And then the new operation code is, for the moment, made to just initialise
      the operation, get the appropriate vnode I/O locks and do the same rotation
      loop as before.
      
      This lays the foundation for the following changes in the future:
      
       (*) Overhauling the rotation (again).
      
       (*) Support for asynchronous I/O, where the fileserver rotation must be
           done asynchronously also.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      e49c7b2f
    • C
      fs: move fiemap range validation into the file systems instances · cddf8a2c
      Christoph Hellwig 提交于
      Replace fiemap_check_flags with a fiemap_prep helper that also takes the
      inode and mapped range, and performs the sanity check and truncation
      previously done in fiemap_check_range.  This way the validation is inside
      the file system itself and thus properly works for the stacked overlayfs
      case as well.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Link: https://lore.kernel.org/r/20200523073016.2944131-7-hch@lst.deSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      cddf8a2c
    • C
      iomap: fix the iomap_fiemap prototype · 27328818
      Christoph Hellwig 提交于
      iomap_fiemap should take u64 start and len arguments, just like the
      ->fiemap prototype.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Link: https://lore.kernel.org/r/20200523073016.2944131-6-hch@lst.deSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      27328818
    • C
      fs: move the fiemap definitions out of fs.h · 10c5db28
      Christoph Hellwig 提交于
      No need to pull the fiemap definitions into almost every file in the
      kernel build.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Link: https://lore.kernel.org/r/20200523073016.2944131-5-hch@lst.deSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      10c5db28
    • C
      fs: mark __generic_block_fiemap static · 44ebcd06
      Christoph Hellwig 提交于
      There is no caller left outside of ioctl.c.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Link: https://lore.kernel.org/r/20200523073016.2944131-4-hch@lst.deSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      44ebcd06
    • R
      ext4: mballoc: use lock for checking free blocks while retrying · 99377830
      Ritesh Harjani 提交于
      Currently while doing block allocation grp->bb_free may be getting
      modified if discard is happening in parallel.
      For e.g. consider a case where there are lot of threads who have
      preallocated lot of blocks and there is a thread which is trying
      to discard all of this group's PA. Now it could happen that
      we see all of those group's bb_free is zero and fail the allocation
      while there is sufficient space if we free up all the PA.
      
      So this patch adds another flag "EXT4_MB_STRICT_CHECK" which will be set
      if we are unable to allocate any blocks in the first try (since we may
      not have considered blocks about to be discarded from PA lists).
      So during retry attempt to allocate blocks we will use ext4_lock_group()
      for checking if the group is good or not.
      Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Link: https://lore.kernel.org/r/9cb740a117c958c36596f167b12af1beae9a68b7.1589955723.git.riteshh@linux.ibm.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      99377830
    • J
      writeback: Export inode_io_list_del() · 4301efa4
      Jan Kara 提交于
      Ext4 needs to remove inode from writeback lists after it is out of
      visibility of its journalling machinery (which can still dirty the
      inode). Export inode_io_list_del() for it.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20200421085445.5731-3-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      4301efa4
    • E
      ext4: translate a few more map flags to strings in tracepoints · 493e83aa
      Eric Whitney 提交于
      As new ext4_map_blocks() flags have been added, not all have gotten flag
      bit to string translations to make tracepoint output more readable.
      Fix that, and go one step further by adding a translation for the
      EXT4_EX_NOCACHE flag as well.  The EXT4_EX_FORCE_CACHE flag can never
      be set in a tracepoint in the current code, so there's no need to
      bother with a translation for it right now.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20200415203140.30349-3-enwlinux@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      493e83aa
    • E
      ext4: remove EXT4_GET_BLOCKS_KEEP_SIZE flag · 9e52484c
      Eric Whitney 提交于
      The eofblocks code was removed in the 5.7 release by "ext4: remove
      EOFBLOCKS_FL and associated code" (4337ecd1).  The ext4_map_blocks()
      flag used to trigger it can now be removed as well.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20200415203140.30349-2-enwlinux@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      9e52484c
    • C
      include/linux/memblock.h: fix minor typo and unclear comment · 8cbd54f5
      chenqiwu 提交于
      Fix a minor typo "usabe->usable" for the current discription of member
      variable "memory" in struct memblock.
      
      BTW, I think it's unclear the member variable "base" in struct
      memblock_type is currently described as the physical address of memory
      region, change it to base address of the region is clearer since the
      variable is decorated as phys_addr_t.
      Signed-off-by: Nchenqiwu <chenqiwu@xiaomi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Link: http://lkml.kernel.org/r/1588846952-32166-1-git-send-email-qiwuchen55@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cbd54f5
    • J
      mm: vmscan: reclaim writepage is IO cost · 96f8bf4f
      Johannes Weiner 提交于
      The VM tries to balance reclaim pressure between anon and file so as to
      reduce the amount of IO incurred due to the memory shortage.  It already
      counts refaults and swapins, but in addition it should also count
      writepage calls during reclaim.
      
      For swap, this is obvious: it's IO that wouldn't have occurred if the
      anonymous memory hadn't been under memory pressure.  From a relative
      balancing point of view this makes sense as well: even if anon is cold and
      reclaimable, a cache that isn't thrashing may have equally cold pages that
      don't require IO to reclaim.
      
      For file writeback, it's trickier: some of the reclaim writepage IO would
      have likely occurred anyway due to dirty expiration.  But not all of it -
      premature writeback reduces batching and generates additional writes.
      Since the flushers are already woken up by the time the VM starts writing
      cache pages one by one, let's assume that we'e likely causing writes that
      wouldn't have happened without memory pressure.  In addition, the per-page
      cost of IO would have probably been much cheaper if written in larger
      batches from the flusher thread rather than the single-page-writes from
      kswapd.
      
      For our purposes - getting the trend right to accelerate convergence on a
      stable state that doesn't require paging at all - this is sufficiently
      accurate.  If we later wanted to optimize for sustained thrashing, we can
      still refine the measurements.
      
      Count all writepage calls from kswapd as IO cost toward the LRU that the
      page belongs to.
      
      Why do this dynamically?  Don't we know in advance that anon pages require
      IO to reclaim, and so could build in a static bias?
      
      First, scanning is not the same as reclaiming.  If all the anon pages are
      referenced, we may not swap for a while just because we're scanning the
      anon list.  During this time, however, it's important that we age
      anonymous memory and the page cache at the same rate so that their
      hot-cold gradients are comparable.  Everything else being equal, we still
      want to reclaim the coldest memory overall.
      
      Second, we keep copies in swap unless the page changes.  If there is
      swap-backed data that's mostly read (tmpfs file) and has been swapped out
      before, we can reclaim it without incurring additional IO.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96f8bf4f
    • J
      mm: vmscan: determine anon/file pressure balance at the reclaim root · 7cf111bc
      Johannes Weiner 提交于
      We split the LRU lists into anon and file, and we rebalance the scan
      pressure between them when one of them begins thrashing: if the file cache
      experiences workingset refaults, we increase the pressure on anonymous
      pages; if the workload is stalled on swapins, we increase the pressure on
      the file cache instead.
      
      With cgroups and their nested LRU lists, we currently don't do this
      correctly.  While recursive cgroup reclaim establishes a relative LRU
      order among the pages of all involved cgroups, LRU pressure balancing is
      done on an individual cgroup LRU level.  As a result, when one cgroup is
      thrashing on the filesystem cache while a sibling may have cold anonymous
      pages, pressure doesn't get equalized between them.
      
      This patch moves LRU balancing decision to the root of reclaim - the same
      level where the LRU order is established.
      
      It does this by tracking LRU cost recursively, so that every level of the
      cgroup tree knows the aggregate LRU cost of all memory within its domain.
      When the page scanner calculates the scan balance for any given individual
      cgroup's LRU list, it uses the values from the ancestor cgroup that
      initiated the reclaim cycle.
      
      If one sibling is then thrashing on the cache, it will tip the pressure
      balance inside its ancestors, and the next hierarchical reclaim iteration
      will go more after the anon pages in the tree.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-13-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cf111bc
    • J
      mm: balance LRU lists based on relative thrashing · 314b57fb
      Johannes Weiner 提交于
      Since the LRUs were split into anon and file lists, the VM has been
      balancing between page cache and anonymous pages based on per-list ratios
      of scanned vs.  rotated pages.  In most cases that tips page reclaim
      towards the list that is easier to reclaim and has the fewest actively
      used pages, but there are a few problems with it:
      
      1. Refaults and LRU rotations are weighted the same way, even though
         one costs IO and the other costs a bit of CPU.
      
      2. The less we scan an LRU list based on already observed rotations,
         the more we increase the sampling interval for new references, and
         rotations become even more likely on that list. This can enter a
         death spiral in which we stop looking at one list completely until
         the other one is all but annihilated by page reclaim.
      
      Since commit a528910e ("mm: thrash detection-based file cache sizing")
      we have refault detection for the page cache.  Along with swapin events,
      they are good indicators of when the file or anon list, respectively, is
      too small for its workingset and needs to grow.
      
      For example, if the page cache is thrashing, the cache pages need more
      time in memory, while there may be colder pages on the anonymous list.
      Likewise, if swapped pages are faulting back in, it indicates that we
      reclaim anonymous pages too aggressively and should back off.
      
      Replace LRU rotations with refaults and swapins as the basis for relative
      reclaim cost of the two LRUs.  This will have the VM target list balances
      that incur the least amount of IO on aggregate.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-12-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314b57fb
    • J
      mm: base LRU balancing on an explicit cost model · 1431d4d1
      Johannes Weiner 提交于
      Currently, scan pressure between the anon and file LRU lists is balanced
      based on a mixture of reclaim efficiency and a somewhat vague notion of
      "value" of having certain pages in memory over others.  That concept of
      value is problematic, because it has caused us to count any event that
      remotely makes one LRU list more or less preferrable for reclaim, even
      when these events are not directly comparable and impose very different
      costs on the system.  One example is referenced file pages that we still
      deactivate and referenced anonymous pages that we actually rotate back to
      the head of the list.
      
      There is also conceptual overlap with the LRU algorithm itself.  By
      rotating recently used pages instead of reclaiming them, the algorithm
      already biases the applied scan pressure based on page value.  Thus, when
      rebalancing scan pressure due to rotations, we should think of reclaim
      cost, and leave assessing the page value to the LRU algorithm.
      
      Lastly, considering both value-increasing as well as value-decreasing
      events can sometimes cause the same type of event to be counted twice,
      i.e.  how rotating a page increases the LRU value, while reclaiming it
      succesfully decreases the value.  In itself this will balance out fine,
      but it quietly skews the impact of events that are only recorded once.
      
      The abstract metric of "value", the murky relationship with the LRU
      algorithm, and accounting both negative and positive events make the
      current pressure balancing model hard to reason about and modify.
      
      This patch switches to a balancing model of accounting the concrete,
      actually observed cost of reclaiming one LRU over another.  For now, that
      cost includes pages that are scanned but rotated back to the list head.
      Subsequent patches will add consideration for IO caused by refaulting of
      recently evicted pages.
      
      Replace struct zone_reclaim_stat with two cost counters in the lruvec, and
      make everything that affects cost go through a new lru_note_cost()
      function.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Link: http://lkml.kernel.org/r/20200520232525.798933-9-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1431d4d1