提交 · def9b71ee651a6fee93a10734b94f93a69cdb2d4 · OpenHarmony / kernel_linux

01 2月, 2018 25 次提交

include/linux/mmzone.h: fix explanation of lower bits in the SPARSEMEM mem_map pointer · def9b71e

由 Petr Tesarik 提交于 1月 31, 2018

The comment is confusing.  On the one hand, it refers to 32-bit
alignment (struct page alignment on 32-bit platforms), but this would
only guarantee that the 2 lowest bits must be zero.  On the other hand,
it claims that at least 3 bits are available, and 3 bits are actually
used.

This is not broken, because there is a stronger alignment guarantee,
just less obvious.  Let's fix the comment to make it clear how many bits
are available and why.

Although memmap arrays are allocated in various places, the resulting
pointer is encoded eventually, so I am adding a BUG_ON() here to enforce
at runtime that all expected bits are indeed available.

I have also added a BUILD_BUG_ON to check that PFN_SECTION_SHIFT is
sufficient, because this part of the calculation can be easily checked
at build time.

[ptesarik@suse.com: v2]
  Link: http://lkml.kernel.org/r/20180125100516.589ea6af@ezekiel.suse.cz
Link: http://lkml.kernel.org/r/20180119080908.3a662e6f@ezekiel.suse.czSigned-off-by: NPetr Tesarik <ptesarik@suse.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

def9b71e

zswap: only save zswap header when necessary · 9c3760eb

由 Yu Zhao 提交于 1月 31, 2018

We waste sizeof(swp_entry_t) for zswap header when using zsmalloc as
zpool driver because zsmalloc doesn't support eviction.

Add zpool_evictable() to detect if zpool is potentially evictable, and
use it in zswap to avoid waste memory for zswap header.

[yuzhao@google.com: The zpool->" prefix is a result of copy & paste]
  Link: http://lkml.kernel.org/r/20180110225626.110330-1-yuzhao@google.com
Link: http://lkml.kernel.org/r/20180110224741.83751-1-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
Acked-by: NDan Streetman <ddstreet@ieee.org>
Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9c3760eb

hugetlb: implement memfd sealing · ff62a342

由 Marc-André Lureau 提交于 1月 31, 2018

Implements memfd sealing, similar to shmem:
 - WRITE: deny fallocate(PUNCH_HOLE). mmap() write is denied in
   memfd_add_seals(). write() doesn't exist for hugetlbfs.
 - SHRINK: added similar check as shmem_setattr()
 - GROW: added similar check as shmem_setattr() & shmem_fallocate()

Except write() operation that doesn't exist with hugetlbfs, that should
make sealing as close as it can be to shmem support.

Link: http://lkml.kernel.org/r/20171107122800.25517-5-marcandre.lureau@redhat.comSigned-off-by: NMarc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ff62a342

hugetlb: expose hugetlbfs_inode_info in header · da14c1e5

由 Marc-André Lureau 提交于 1月 31, 2018

hugetlbfs inode information will need to be accessed by code in
mm/shmem.c for file sealing operations.  Move inode information
definition from .c file to header for needed access.

Link: http://lkml.kernel.org/r/20171107122800.25517-4-marcandre.lureau@redhat.comSigned-off-by: NMarc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

da14c1e5

shmem: rename functions that are memfd-related · 5aadc431

由 Marc-André Lureau 提交于 1月 31, 2018

Those functions are called for memfd files, backed by shmem or hugetlb
(the next patches will handle hugetlb).

Link: http://lkml.kernel.org/r/20171107122800.25517-3-marcandre.lureau@redhat.comSigned-off-by: NMarc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5aadc431

shmem: unexport shmem_add_seals()/shmem_get_seals() · e9d586a8

由 Marc-André Lureau 提交于 1月 31, 2018

Patch series "memfd: add sealing to hugetlb-backed memory", v3.

Recently, Mike Kravetz added hugetlbfs support to memfd.  However, he
didn't add sealing support.  One of the reasons to use memfd is to have
shared memory sealing when doing IPC or sharing memory with another
process with some extra safety.  qemu uses shared memory & hugetables
with vhost-user (used by dpdk), so it is reasonable to use memfd now
instead for convenience and security reasons.

This patch (of 9):

The functions are called through shmem_fcntl() only.  And no danger in
removing the EXPORTs as the routines only work with shmem file structs.

Link: http://lkml.kernel.org/r/20171107122800.25517-2-marcandre.lureau@redhat.comSigned-off-by: NMarc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e9d586a8

mm: remove reference to PG_buddy · ab8928b7

由 Matthew Wilcox 提交于 1月 31, 2018

PG_buddy doesn't exist any more. It's called PageBuddy now.

Link: http://lkml.kernel.org/r/20171220155552.15884-9-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ab8928b7

mm: document how to use struct page · be50015d

由 Matthew Wilcox 提交于 1月 31, 2018

Be really explicit about what bits / bytes are reserved for users that
want to store extra information about the pages they allocate.

Link: http://lkml.kernel.org/r/20171220155552.15884-8-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: NRandy Dunlap <rdunlap@infradead.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

be50015d

mm: store compound_dtor / compound_order as bytes · 036e7aa4

由 Matthew Wilcox 提交于 1月 31, 2018

Neither of these values get even close to 256; compound_dtor is
currently at a maximum of 3, and compound_order can't be over 64.  No
machine has inefficient access to bytes since EV5, and while those are
still supported, we don't optimise for them any more.  This does not
shrink struct page, but it removes an ifdef and frees up 2-6 bytes for
future use.

diff of pahole output:

 		struct callback_head callback_head;      /*    32    16 */
 		struct {
 			long unsigned int compound_head; /*    32     8 */
-			unsigned int compound_dtor;      /*    40     4 */
-			unsigned int compound_order;     /*    44     4 */
+			unsigned char compound_dtor;     /*    40     1 */
+			unsigned char compound_order;    /*    41     1 */
 		};                                       /*    32    16 */
 	};                                               /*    32    16 */
 	union {

[mawilcox@microsoft.com: add comment]
  Link: http://lkml.kernel.org/r/20171221000144.GB2980@bombadil.infradead.org
Link: http://lkml.kernel.org/r/20171220155552.15884-7-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

036e7aa4

mm: introduce _slub_counter_t · 0dd4da5b

由 Matthew Wilcox 提交于 1月 31, 2018

Instead of putting the ifdef in the middle of the definition of struct
page, pull it forward to the rest of the ifdeffery around the SLUB
cmpxchg_double optimisation.

Link: http://lkml.kernel.org/r/20171220155552.15884-6-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0dd4da5b

mm: improve comment on page->mapping · b26435a0

由 Matthew Wilcox 提交于 1月 31, 2018

The comment on page->mapping is terse, and out of date (it does not
mention the possibility of PAGE_MAPPING_MOVABLE). Instead, point the
interested reader to page-flags.h where there is a much better comment.

Link: http://lkml.kernel.org/r/20171220155552.15884-5-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b26435a0

mm: remove misleading alignment claims · 4cf7c8bf

由 Matthew Wilcox 提交于 1月 31, 2018

The "third double word block" isn't on 32-bit systems.  The layout looks
like this:

	unsigned long flags;
	struct address_space *mapping
	pgoff_t index;
	atomic_t _mapcount;
	atomic_t _refcount;

which is 32 bytes on 64-bit, but 20 bytes on 32-bit.  Nobody is trying to
use the fact that it's double-word aligned today, so just remove the
misleading claims.

Link: http://lkml.kernel.org/r/20171220155552.15884-4-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4cf7c8bf

mm: de-indent struct page · ca9c88c7

由 Matthew Wilcox 提交于 1月 31, 2018

I found the struct { union { struct { union { struct { } } } } } layout
rather confusing.  Fortunately, there is an easier way to write this.

The innermost union is of four things which are the size of an int, so
the ones which are used by slab/slob/slub can be pulled up two levels to
be in the outermost union with 'counters'.  That leaves us with struct {
union { struct { atomic_t; atomic_t; } } } which has the same layout,
but is easier to read.

Output from the current git version of pahole, diffed with -uw to ignore
the whitespace changes from the indentation:

 	};						/*    16     8 */
 	union {
 		long unsigned int  counters;		/*    24     8 */
-		struct {
-			union {
-				atomic_t _mapcount;	/*    24     4 */
 				unsigned int active;	/*    24     4 */
 				struct {
 					unsigned int inuse:16; /*    24:16  4 */
@@ -21,7 +18,8 @@
 					unsigned int frozen:1; /*    24: 0  4 */
 				};			/*    24     4 */
 				int units;		/*    24     4 */
-			};				/*    24     4 */
+		struct {
+			atomic_t   _mapcount;		/*    24     4 */
 			atomic_t   _refcount;		/*    28     4 */
 		};					/*    24     8 */
 	};						/*    24     8 */

Link: http://lkml.kernel.org/r/20171220155552.15884-3-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ca9c88c7

mm: align struct page more aesthetically · e20df2c6

由 Matthew Wilcox 提交于 1月 31, 2018

Patch series "Restructure struct page", v2.

This series does not attempt any grand restructuring.  Instead, it cures
the worst of the indentitis, fixes the documentation and reduces the
ifdeffery.  The only layout change is compound_dtor and compound_order
are each reduced to one byte.

This patch (of 8):

Instead of an ifdef block at the end of the struct, which needed its own
comment, define _struct_page_alignment up at the top where it fits
nicely with the existing comment.

Link: http://lkml.kernel.org/r/20171220155552.15884-2-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NChristoph Lameter <cl@linux.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e20df2c6

mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks · 5ff7091f

由 David Rientjes 提交于 1月 31, 2018

Commit 4d4bbd85 ("mm, oom_reaper: skip mm structs with mmu
notifiers") prevented the oom reaper from unmapping private anonymous
memory with the oom reaper when the oom victim mm had mmu notifiers
registered.

The rationale is that doing mmu_notifier_invalidate_range_{start,end}()
around the unmap_page_range(), which is needed, can block and the oom
killer will stall forever waiting for the victim to exit, which may not
be possible without reaping.

That concern is real, but only true for mmu notifiers that have
blockable invalidate_range_{start,end}() callbacks.  This patch adds a
"flags" field to mmu notifier ops that can set a bit to indicate that
these callbacks do not block.

The implementation is steered toward an expensive slowpath, such as
after the oom reaper has grabbed mm->mmap_sem of a still alive oom
victim.

[rientjes@google.com: mmu_notifier_invalidate_range_end() can also call the invalidate_range() must not block, fix comment]
  Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1801091339570.240101@chino.kir.corp.google.com
[akpm@linux-foundation.org: make mm_has_blockable_invalidate_notifiers() return bool, use rwsem_is_locked()]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1712141329500.74052@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Acked-by: NDimitri Sivanich <sivanich@hpe.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5ff7091f

mm: get 7% more pages in a pagevec · 146500e9

由 Matthew Wilcox 提交于 1月 31, 2018

We don't have to use an entire 'long' for the number of elements in the
pagevec; we know it's a number between 0 and 14 (now 15). So we can
store it in a char, and then the bool packs next to it and we still have
two or six bytes of padding for more elements in the header. That gives
us space to cram in an extra page.

Link: http://lkml.kernel.org/r/20171206022521.GM26021@bombadil.infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

146500e9

mm: add unmap_mapping_pages() · 977fbdcd

由 Matthew Wilcox 提交于 1月 31, 2018

Several users of unmap_mapping_range() would prefer to express their
range in pages rather than bytes. Unfortuately, on a 32-bit kernel, you
have to remember to cast your page number to a 64-bit type before
shifting it, and four places in the current tree didn't remember to do
that. That's a sign of a bad interface.

Conveniently, unmap_mapping_range() actually converts from bytes into
pages, so hoist the guts of unmap_mapping_range() into a new function
unmap_mapping_pages() and convert the callers which want to use pages.

Link: http://lkml.kernel.org/r/20171206142627.GD32044@bombadil.infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Reported-by: N"zhangyi (F)" <yi.zhang@huawei.com>
Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

977fbdcd

mm, hugetlb: remove hugepages_treat_as_movable sysctl · d6cb41cc

由 Michal Hocko 提交于 1月 31, 2018

hugepages_treat_as_movable has been introduced by 396faf03 ("Allow
huge page allocations to use GFP_HIGH_MOVABLE") to allow hugetlb
allocations from ZONE_MOVABLE even when hugetlb pages were not
migrateable.  The purpose of the movable zone was different at the time.
It aimed at reducing memory fragmentation and hugetlb pages being long
lived and large werre not contributing to the fragmentation so it was
acceptable to use the zone back then.

Things have changed though and the primary purpose of the zone became
migratability guarantee.  If we allow non migrateable hugetlb pages to
be in ZONE_MOVABLE memory hotplug might fail to offline the memory.

Remove the knob and only rely on hugepage_migration_supported to allow
movable zones.

Mel said:

: Primarily it was aimed at allowing the hugetlb pool to safely shrink with
: the ability to grow it again.  The use case was for batched jobs, some of
: which needed huge pages and others that did not but didn't want the memory
: useless pinned in the huge pages pool.
:
: I suspect that more users rely on THP than hugetlbfs for flexible use of
: huge pages with fallback options so I think that removing the option
: should be ok.

Link: http://lkml.kernel.org/r/20171003072619.8654-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Reported-by: NAlexandru Moise <00moses.alexander00@gmail.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Cc: Alexandru Moise <00moses.alexander00@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d6cb41cc

mm: remove unused pgdat_reclaimable_pages() · a4ef8768

由 Jan Kara 提交于 1月 31, 2018

Remove unused function pgdat_reclaimable_pages() and
node_page_state_snapshot() which becomes unused as well.

Link: http://lkml.kernel.org/r/20171122094416.26019-1-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
Acked-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a4ef8768

mm: memcontrol: fix excessive complexity in memory.stat reporting · a983b5eb

由 Johannes Weiner 提交于 1月 31, 2018

We've seen memory.stat reads in top-level cgroups take up to fourteen
seconds during a userspace bug that created tens of thousands of ghost
cgroups pinned by lingering page cache.

Even with a more reasonable number of cgroups, aggregating memory.stat
is unnecessarily heavy. The complexity is this:

nr_cgroups * nr_stat_items * nr_possible_cpus

where the stat items are ~70 at this point. With 128 cgroups and 128
CPUs - decent, not enormous setups - reading the top-level memory.stat
has to aggregate over a million per-cpu counters. This doesn't scale.

Instead of spreading the source of truth across all CPUs, use the
per-cpu counters merely to batch updates to shared atomic counters.

This is the same as the per-cpu stocks we use for charging memory to the
shared atomic page_counters, and also the way the global vmstat counters
are implemented.

Vmstat has elaborate spilling thresholds that depend on the number of
CPUs, amount of memory, and memory pressure - carefully balancing the
cost of counter updates with the amount of per-cpu error. That's
because the vmstat counters are system-wide, but also used for decisions
inside the kernel (e.g. NR_FREE_PAGES in the allocator). Neither is
true for the memory controller.

Use the same static batch size we already use for page_counter updates
during charging. The per-cpu error in the stats will be 128k, which is
an acceptable ratio of cores to memory accounting granularity.

[hannes@cmpxchg.org: fix warning in __this_cpu_xchg() calls]
Link: http://lkml.kernel.org/r/20171201135750.GB8097@cmpxchg.org
Link: http://lkml.kernel.org/r/20171103153336.24044-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a983b5eb

mm: memcontrol: implement lruvec stat functions on top of each other · 28454265

由 Johannes Weiner 提交于 1月 31, 2018

The implementation of the lruvec stat functions and their variants for
accounting through a page, or accounting from a preemptible context, are
mostly identical and needlessly repetitive.

Implement the lruvec_page functions by looking up the page's lruvec and
then using the lruvec function.

Implement the functions for preemptible contexts by disabling preemption
before calling the atomic context functions.

Link: http://lkml.kernel.org/r/20171103153336.24044-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

28454265

mm: memcontrol: eliminate raw access to stat and event counters · c9019e9b

由 Johannes Weiner 提交于 1月 31, 2018

Replace all raw 'this_cpu_' modifications of the stat and event per-cpu
counters with API functions such as mod_memcg_state().

This makes the code easier to read, but is also in preparation for the
next patch, which changes the per-cpu implementation of those counters.

Link: http://lkml.kernel.org/r/20171103153336.24044-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c9019e9b

mm: drop hotplug lock from lru_add_drain_all() · 9852a721

由 Michal Hocko 提交于 1月 31, 2018

Pulling cpu hotplug locks inside the mm core function like
lru_add_drain_all just asks for problems and the recent lockdep splat
[1] just proves this.  While the usage in that particular case might be
wrong we should avoid the locking as lru_add_drain_all() is used in many
places.  It seems that this is not all that hard to achieve actually.

We have done the same thing for drain_all_pages which is analogous by
commit a459eeb7 ("mm, page_alloc: do not depend on cpu hotplug locks
inside the allocator").  All we have to care about is to handle

      - the work item might be executed on a different cpu in worker from
        unbound pool so it doesn't run on pinned on the cpu

      - we have to make sure that we do not race with page_alloc_cpu_dead
        calling lru_add_drain_cpu

the first part is already handled because the worker calls lru_add_drain
which disables preemption when calling lru_add_drain_cpu on the local
cpu it is draining.  The later is true because page_alloc_cpu_dead is
called on the controlling CPU after the hotplugged CPU vanished
completely.

[1] http://lkml.kernel.org/r/089e0825eec8955c1f055c83d476@google.com

[add a cpu hotplug locking interaction as per tglx]
Link: http://lkml.kernel.org/r/20171116120535.23765-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9852a721

include/linux/sched/mm.h: uninline mmdrop_async(), etc · d70f2a14

由 Andrew Morton 提交于 1月 31, 2018

mmdrop_async() is only used in fork.c.  Move that and its support
functions into fork.c, uninline it all.

Quite a lot of code gets moved around to avoid forward declarations.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d70f2a14

iversion: make inode_cmp_iversion{+raw} return bool instead of s64 · c0cef30e

由 Jeff Layton 提交于 1月 30, 2018

As Linus points out:

    The inode_cmp_iversion{+raw}() functions are pure and utter crap.

    Why?

    You say that they return 0/negative/positive, but they do so in a
    completely broken manner. They return that ternary value as the
    sequence number difference in a 's64', which means that if you
    actually care about that ternary value, and do the *sane* thing that
    the kernel-doc of the function implies is the right thing, you would
    do

        int cmp = inode_cmp_iversion(inode, old);
        if (cmp < 0 ...

    and as a result you get code that looks sane, but that doesn't
    actually *WORK* right.

Since none of the callers actually care about the ternary value here,
convert the inode_cmp_iversion{+raw} functions to just return a boolean
value (false for matching, true for non-matching).

This matches the existing use of these functions just fine, and makes it
simple to convert them to return a ternary value in the future if we
grow callers that need it.

With this change we can also reimplement inode_cmp_iversion in a simpler
way using inode_peek_iversion.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c0cef30e

30 1月, 2018 1 次提交

dm mpath: delay the retry of a request if the target responded as busy · ac514ffc

由 Mike Snitzer 提交于 1月 12, 2018

Add DM_ENDIO_DELAY_REQUEUE to allow request-based multipath's
multipath_end_io() to instruct dm-rq.c:dm_done() to delay a requeue.
This is beneficial to do if BLK_STS_RESOURCE is returned from the target
(because target is busy).

Relative to blk-mq: kick the hw queues via blk_mq_requeue_work(),
indirectly from dm-rq.c:__dm_mq_kick_requeue_list(), after a delay.

For old .request_fn: use blk_delay_queue().

bio-based multipath doesn't have feature parity with request-based for
retryable error requeues; that is something that'll need fixing in the
future.
Suggested-by: NBart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Acked-by: NBart Van Assche <bart.vanassche@wdc.com>
[as interpreted from Bart's "... patch looks fine to me."]

ac514ffc

29 1月, 2018 4 次提交

xfs: only grab shared inode locks for source file during reflink · 01c2e13d

由 Darrick J. Wong 提交于 1月 18, 2018

Reflink and dedupe operations remap blocks from a source file into a
destination file. The destination file needs exclusive locks on all
levels because we're updating its block map, but the source file isn't
undergoing any block map changes so we can use a shared lock.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

01c2e13d

fs: handle inode->i_version more efficiently · f02a9ad1

由 Jeff Layton 提交于 12月 21, 2017

Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.

Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.

When we go to maybe increment it, we fetch the value and check the flag
bit.  If it's clear then we don't need to do anything if the update
isn't being forced.

If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.

On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.

This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Acked-by: NDave Chinner <dchinner@redhat.com>
Tested-by: NKrzysztof Kozlowski <krzk@kernel.org>

f02a9ad1

fs: don't take the i_lock in inode_inc_iversion · 7594c461

由 Jeff Layton 提交于 12月 18, 2017

The rationale for taking the i_lock when incrementing this value is
lost in antiquity. The readers of the field don't take it (at least
not universally), so my assumption is that it was only done here to
serialize incrementors.

If that is indeed the case, then we can drop the i_lock from this
codepath and treat it as a atomic64_t for the purposes of
incrementing it. This allows us to use inode_inc_iversion without
any danger of lock inversion.

Note that the read side is not fetched atomically with this change.
The assumption here is that that is not a critical issue since the
i_version is not fully synchronized with anything else anyway.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Reviewed-by: NJan Kara <jack@suse.cz>

7594c461

fs: new API for handling inode->i_version · ae5e165d

由 Jeff Layton 提交于 1月 29, 2018

Add a documentation blob that explains what the i_version field is, how
it is expected to work, and how it is currently implemented by various
filesystems.

We already have inode_inc_iversion. Add several other functions for
manipulating and accessing the i_version counter. For now, the
implementation is trivial and basically works the way that all of the
open-coded i_version accesses work today.

Future patches will convert existing users of i_version to use the new
API, and then convert the backend implementation to do things more
efficiently.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Reviewed-by: NJan Kara <jack@suse.cz>

ae5e165d

26 1月, 2018 8 次提交

regulator: add PM suspend and resume hooks · f7efad10

由 Chunyan Zhang 提交于 1月 26, 2018

In this patch, consumers are allowed to set suspend voltage, and this
actually just set the "uV" in constraint::regulator_state, when the
regulator_suspend_late() was called by PM core through callback when
the system is entering into suspend, the regulator device would act
suspend activity then.

And it assumes that if any consumer set suspend voltage, the regulator
device should be enabled in the suspend state.  And if the suspend
voltage of a regulator device for all consumers was set zero, the
regulator device would be off in the suspend state.

This patch also provides a new function hook to regulator devices for
resuming from suspend states.
Signed-off-by: NChunyan Zhang <zhang.chunyan@linaro.org>
Signed-off-by: NMark Brown <broonie@kernel.org>

f7efad10

regulator: empty the old suspend functions · aa27bbc6

由 Chunyan Zhang 提交于 1月 26, 2018

Regualtor suspend/resume functions should only be called by PM suspend
core via registering dev_pm_ops, and regulator devices should implement
the callback functions.  Thus, any regulator consumer shouldn't call
the regulator suspend/resume functions directly.

In order to avoid compile errors, two empty functions with the same name
still be left for the time being.
Signed-off-by: NChunyan Zhang <zhang.chunyan@linaro.org>
Signed-off-by: NMark Brown <broonie@kernel.org>

aa27bbc6

regulator: leave one item to record whether regulator is enabled · 72069f99

由 Chunyan Zhang 提交于 1月 26, 2018

The items "disabled" and "enabled" are a little redundant, since only one
of them would be set to record if the regulator device should keep on
or be switched to off in suspend states.

So in this patch, the "disabled" was removed, only leave the "enabled":
  - enabled == 1 for regulator-on-in-suspend
  - enabled == 0 for regulator-off-in-suspend
  - enabled == -1 means do nothing when entering suspend mode.
Signed-off-by: NChunyan Zhang <zhang.chunyan@linaro.org>
Signed-off-by: NMark Brown <broonie@kernel.org>

72069f99

module/retpoline: Warn about missing retpoline in module · caf7501a

由 Andi Kleen 提交于 1月 25, 2018

There's a risk that a kernel which has full retpoline mitigations becomes
vulnerable when a module gets loaded that hasn't been compiled with the
right compiler or the right option.

To enable detection of that mismatch at module load time, add a module info
string "retpoline" at build time when the module was compiled with
retpoline support. This only covers compiled C source, but assembler source
or prebuilt object files are not checked.

If a retpoline enabled kernel detects a non retpoline protected module at
load time, print a warning and report it in the sysfs vulnerability file.

[ tglx: Massaged changelog ]
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: gregkh@linuxfoundation.org
Cc: torvalds@linux-foundation.org
Cc: jeyu@kernel.org
Cc: arjan@linux.intel.com
Link: https://lkml.kernel.org/r/20180125235028.31211-1-andi@firstfloor.org

caf7501a

fs/buffer.c: fold init_buffer() into init_page_buffers() · 01950a34

由 Eric Biggers 提交于 1月 16, 2018

Since commit e7600409 ("fs/buffer.c: remove unnecessary init
operation after allocating buffer_head"), there are no callers of
init_buffer() outside of init_page_buffers().  So just fold it into
init_page_buffers().
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

01950a34

fs: fold __inode_permission() into inode_permission() · 4bfd054a

由 Eric Biggers 提交于 1月 16, 2018

Since commit 9c630ebe ("ovl: simplify permission checking"),
overlayfs doesn't call __inode_permission() anymore, which leaves no
users other than inode_permission().  So just fold it back into
inode_permission().
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4bfd054a

fs: add RWF_APPEND · e1fc742e

由 Jürg Billeter 提交于 9月 29, 2017

This is the per-I/O equivalent of O_APPEND to support atomic append
operations on any open file.

If a file is opened with O_APPEND, pwrite() ignores the offset and
always appends data to the end of the file. RWF_APPEND enables atomic
append and pwrite() with offset on a single file descriptor.
Signed-off-by: NJürg Billeter <j@bitron.ch>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e1fc742e

f2fs: support inode creation time · 1c1d35df

由 Chao Yu 提交于 1月 25, 2018

This patch adds creation time field in inode layout to support showing
kstat.btime in ->statx.
Signed-off-by: NChao Yu <yuchao0@huawei.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

1c1d35df

25 1月, 2018 1 次提交

Revert "module: Add retpoline tag to VERMAGIC" · 5132ede0

由 Greg Kroah-Hartman 提交于 1月 24, 2018

This reverts commit 6cfb521a.

Turns out distros do not want to make retpoline as part of their "ABI",
so this patch should not have been merged.  Sorry Andi, this was my
fault, I suggested it when your original patch was the "correct" way of
doing this instead.
Reported-by: NJiri Kosina <jikos@kernel.org>
Fixes: 6cfb521a ("module: Add retpoline tag to VERMAGIC")
Acked-by: NAndi Kleen <ak@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: rusty@rustcorp.com.au
Cc: arjan.van.de.ven@intel.com
Cc: jeyu@kernel.org
Cc: stable <stable@vger.kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5132ede0

24 1月, 2018 1 次提交

sched/core: Fix cpu.max vs. cpuhotplug deadlock · ce48c146

由 Peter Zijlstra 提交于 1月 22, 2018

Tejun reported the following cpu-hotplug lock (percpu-rwsem) read recursion:

  tg_set_cfs_bandwidth()
    get_online_cpus()
      cpus_read_lock()

    cfs_bandwidth_usage_inc()
      static_key_slow_inc()
        cpus_read_lock()
Reported-by: NTejun Heo <tj@kernel.org>
Tested-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180122215328.GP3397@worktopSigned-off-by: NIngo Molnar <mingo@kernel.org>

ce48c146

OpenHarmony / kernel_linux 上一次同步 4 年多

OpenHarmony / kernel_linux
上一次同步 4 年多