提交 · a8d0143730d7b42c9fe6d1435d92ecce6863a62a · openeuler / raspberrypi-kernel

15 1月, 2016 24 次提交

mm: page_alloc: generalize the dirty balance reserve · a8d01437

由 Johannes Weiner 提交于 1月 14, 2016

The dirty balance reserve that dirty throttling has to consider is
merely memory not available to userspace allocations.  There is nothing
writeback-specific about it.  Generalize the name so that it's reusable
outside of that context.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a8d01437

mm: allow GFP_{FS,IO} for page_cache_read page cache allocation · c20cd45e

由 Michal Hocko 提交于 1月 14, 2016

page_cache_read has been historically using page_cache_alloc_cold to
allocate a new page.  This means that mapping_gfp_mask is used as the
base for the gfp_mask.  Many filesystems are setting this mask to
GFP_NOFS to prevent from fs recursion issues.  page_cache_read is called
from the vm_operations_struct::fault() context during the page fault.
This context doesn't need the reclaim protection normally.

ceph and ocfs2 which call filemap_fault from their fault handlers seem
to be OK because they are not taking any fs lock before invoking generic
implementation.  xfs which takes XFS_MMAPLOCK_SHARED is safe from the
reclaim recursion POV because this lock serializes truncate and punch
hole with the page faults and it doesn't get involved in the reclaim.

There is simply no reason to deliberately use a weaker allocation
context when a __GFP_FS | __GFP_IO can be used.  The GFP_NOFS protection
might be even harmful.  There is a push to fail GFP_NOFS allocations
rather than loop within allocator indefinitely with a very limited
reclaim ability.  Once we start failing those requests the OOM killer
might be triggered prematurely because the page cache allocation failure
is propagated up the page fault path and end up in
pagefault_out_of_memory.

We cannot play with mapping_gfp_mask directly because that would be racy
wrt.  parallel page faults and it might interfere with other users who
really rely on NOFS semantic from the stored gfp_mask.  The mask is also
inode proper so it would even be a layering violation.  What we can do
instead is to push the gfp_mask into struct vm_fault and allow fs layer
to overwrite it should the callback need to be called with a different
allocation context.

Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO)
because this should be safe from the page fault path normally.  Why do
we care about mapping_gfp_mask at all then? Because this doesn't hold
only reclaim protection flags but it also might contain zone and
movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have
to respect those.
Signed-off-by: NMichal Hocko <mhocko@suse.com>
Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: NJan Kara <jack@suse.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c20cd45e

mm: mmap: add new /proc tunable for mmap_base ASLR · d07e2259

由 Daniel Cashman 提交于 1月 14, 2016

Address Space Layout Randomization (ASLR) provides a barrier to
exploitation of user-space processes in the presence of security
vulnerabilities by making it more difficult to find desired code/data
which could help an attack.  This is done by adding a random offset to
the location of regions in the process address space, with a greater
range of potential offset values corresponding to better protection/a
larger search-space for brute force, but also to greater potential for
fragmentation.

The offset added to the mmap_base address, which provides the basis for
the majority of the mappings for a process, is set once on process exec
in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
which reflect, hopefully, the best compromise for all systems.  The
trade-off between increased entropy in the offset value generation and
the corresponding increased variability in address space fragmentation
is not absolute, however, and some platforms may tolerate higher amounts
of entropy.  This patch introduces both new Kconfig values and a sysctl
interface which may be used to change the amount of entropy used for
offset generation on a system.

The direct motivation for this change was in response to the
libstagefright vulnerabilities that affected Android, specifically to
information provided by Google's project zero at:

  http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html

The attack presented therein, by Google's project zero, specifically
targeted the limited randomness used to generate the offset added to the
mmap_base address in order to craft a brute-force-based attack.
Concretely, the attack was against the mediaserver process, which was
limited to respawning every 5 seconds, on an arm device.  The hard-coded
8 bits used resulted in an average expected success rate of defeating
the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
piece).  With this patch, and an accompanying increase in the entropy
value to 16 bits, the same attack would take an average expected time of
over 45 hours (32768 tries), which makes it both less feasible and more
likely to be noticed.

The introduced Kconfig and sysctl options are limited by per-arch
minimum and maximum values, the minimum of which was chosen to match the
current hard-coded value and the maximum of which was chosen so as to
give the greatest flexibility without generating an invalid mmap_base
address, generally a 3-4 bits less than the number of bits in the
user-space accessible virtual address space.

When decided whether or not to change the default value, a system
developer should consider that mmap_base address could be placed
anywhere up to 2^(value) bits away from the non-randomized location,
which would introduce variable-sized areas above and below the mmap_base
address such that the maximum vm_area_struct size may be reduced,
preventing very large allocations.

This patch (of 4):

ASLR only uses as few as 8 bits to generate the random offset for the
mmap base address on 32 bit architectures.  This value was chosen to
prevent a poorly chosen value from dividing the address space in such a
way as to prevent large allocations.  This may not be an issue on all
platforms.  Allow the specification of a minimum number of bits so that
platforms desiring greater ASLR protection may determine where to place
the trade-off.
Signed-off-by: NDaniel Cashman <dcashman@google.com>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: NKees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Borislav Petkov <bp@suse.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d07e2259

memcg: do not allow to disable tcp accounting after limit is set · 9ee11ba4

由 Vladimir Davydov 提交于 1月 14, 2016

There are two bits defined for cg_proto->flags - MEMCG_SOCK_ACTIVATED
and MEMCG_SOCK_ACTIVE - both are set in tcp_update_limit, but the former
is never cleared while the latter can be cleared by unsetting the limit.
This allows to disable tcp socket accounting for new sockets after it
was enabled by writing -1 to memory.kmem.tcp.limit_in_bytes while still
guaranteeing that memcg_socket_limit_enabled static key will be
decremented on memcg destruction.

This functionality looks dubious, because it is not clear what a use
case would be. By enabling tcp accounting a user accepts the price. If
they then find the performance degradation unacceptable, they can always
restart their workload with tcp accounting disabled. It does not seem
there is any need to flip it while the workload is running.

Besides, it contradicts to how kmem accounting API works: writing
whatever to memory.kmem.limit_in_bytes enables kmem accounting for the
cgroup in question, after which it cannot be disabled. Therefore one
might expect that writing -1 to memory.kmem.tcp.limit_in_bytes just
enables socket accounting w/o limiting it, which might be useful by
itself, but it isn't true.

Since this API peculiarity is not documented anywhere, I propose to drop
it. This will allow to simplify the code by dropping cg_proto->flags.
Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9ee11ba4

mm, vmalloc: remove VM_VPAGES · 244d63ee

由 David Rientjes 提交于 1月 14, 2016

VM_VPAGES is unnecessary, it's easier to check is_vmalloc_addr() when
reading /proc/vmallocinfo.

[akpm@linux-foundation.org: remove VM_VPAGES reference via kvfree()]
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

244d63ee

mm, shmem: add internal shmem resident memory accounting · eca56ff9

由 Jerome Marchand 提交于 1月 14, 2016

Currently looking at /proc/<pid>/status or statm, there is no way to
distinguish shmem pages from pages mapped to a regular file (shmem pages
are mapped to /dev/zero), even though their implication in actual memory
use is quite different.

The internal accounting currently counts shmem pages together with
regular files.  As a preparation to extend the userspace interfaces,
this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
shmem pages separately from MM_FILEPAGES.  The next patch will expose it
to userspace - this patch doesn't change the exported values yet, by
adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
used before.  The only user-visible change after this patch is the OOM
killer message that separates the reported "shmem-rss" from "file-rss".

[vbabka@suse.cz: forward-porting, tweak changelog]
Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

eca56ff9

mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings · 48131e03

由 Vlastimil Babka 提交于 1月 14, 2016

Following the previous patch, further reduction of /proc/pid/smaps cost
is possible for private writable shmem mappings with unpopulated areas
where the page walk invokes the .pte_hole function.  We can use radix
tree iterator for each such area instead of calling find_get_entry() in
a loop.  This is possible at the extra maintenance cost of introducing
another shmem function shmem_partial_swap_usage().

To demonstrate the diference, I have measured this on a process that
creates a private writable 2GB mapping of a partially swapped out
/dev/shm/file (which cannot employ the optimizations from the prvious
patch) and doesn't populate it at all.  I time how long does it take to
cat /proc/pid/smaps of this process 100 times.

Before this patch:

real    0m3.831s
user    0m0.180s
sys     0m3.212s

After this patch:

real    0m1.176s
user    0m0.180s
sys     0m0.684s

The time is similar to the case where a radix tree iterator is employed
on the whole mapping.
Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

48131e03

mm, proc: reduce cost of /proc/pid/smaps for shmem mappings · 6a15a370

由 Vlastimil Babka 提交于 1月 14, 2016

The previous patch has improved swap accounting for shmem mapping, which
however made /proc/pid/smaps more expensive for shmem mappings, as we
consult the radix tree for each pte_none entry, so the overal complexity
is O(n*log(n)).

We can reduce this significantly for mappings that cannot contain COWed
pages, because then we can either use the statistics tha shmem object
itself tracks (if the mapping contains the whole object, or the swap
usage of the whole object is zero), or use the radix tree iterator,
which is much more effective than repeated find_get_entry() calls.

This patch therefore introduces a function shmem_swap_usage(vma) and
makes /proc/pid/smaps use it when possible.  Only for writable private
mappings of shmem objects (i.e.  tmpfs files) with the shmem object
itself (partially) swapped outwe have to resort to the find_get_entry()
approach.

Hopefully such mappings are relatively uncommon.

To demonstrate the diference, I have measured this on a process that
creates a 2GB mapping and dirties single pages with a stride of 2MB, and
time how long does it take to cat /proc/pid/smaps of this process 100
times.

Private writable mapping of a /dev/shm/file (the most complex case):

real    0m3.831s
user    0m0.180s
sys     0m3.212s

Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
(which needs to employ the radix tree iterator).

real    0m1.351s
user    0m0.096s
sys     0m0.768s

Same, but with /dev/shm/file not swapped (so no radix tree walk needed)

real    0m0.935s
user    0m0.128s
sys     0m0.344s

Private anonymous mapping:

real    0m0.949s
user    0m0.116s
sys     0m0.348s

The cost is now much closer to the private anonymous mapping case, unless
the shmem mapping is private and writable.
Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6a15a370

mm/mmzone.c: memmap_valid_within() can be boolean · 5b80287a

由 Yaowei Bai 提交于 1月 14, 2016

Make memmap_valid_within return bool due to this particular function
only using either one or zero as its return value.

No functional change.
Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5b80287a

mm/zonelist: enumerate zonelists array index · c00eb15a

由 Yaowei Bai 提交于 1月 14, 2016

Hardcoding index to zonelists array in gfp_zonelist() is not a good
idea, let's enumerate it to improve readability.

No functional change.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
[n-horiguchi@ah.jp.nec.com: fix warning in comparing enumerator]
Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c00eb15a

include/linux/mmzone.h: remove unused is_unevictable_lru() · 06640290

由 Yaowei Bai 提交于 1月 14, 2016

Since commit a0b8cab3 ("mm: remove lru parameter from
__pagevec_lru_add and remove parts of pagevec API") there's no
user of this function anymore, so remove it.
Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

06640290

mm/memblock.c: memblock_is_memory()/reserved() can be boolean · b4ad0c7e

由 Yaowei Bai 提交于 1月 14, 2016

Make memblock_is_memory() and memblock_is_reserved return bool to
improve readability due to these particular functions only using either
one or zero as their return value.

No functional change.
Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b4ad0c7e

include/linux/hugetlb.h: is_file_hugepages() can be boolean · 719ff321

由 Yaowei Bai 提交于 1月 14, 2016

Make is_file_hugepages() return bool to improve readability due to this
particular function only using either one or zero as its return value.

This patch also removed the if condition to make is_file_hugepages
return directly.

No functional change.
Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

719ff321

mm: change mm_vmscan_lru_shrink_inactive() proto types · ba5e9579

由 yalin wang 提交于 1月 14, 2016

Move node_id zone_idx shrink flags into trace function, so thay we don't
need caculate these args if the trace is disabled, and will make this
function have less arguments.
Signed-off-by: Nyalin wang <yalin.wang2010@gmail.com>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ba5e9579

mm/page_isolation.c: add new tracepoint, test_pages_isolated · 0f0848e5

由 Joonsoo Kim 提交于 1月 14, 2016

cma allocation should be guranteeded to succeed.  But sometimes it can
fail in the current implementation.  To track down the problem, we need
to know which page is problematic and this new tracepoint will report
it.
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: NMichal Nazarewicz <mina86@mina86.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0f0848e5

mm/mempolicy.c: convert the shared_policy lock to a rwlock · 4a8c7bb5

由 Nathan Zimmer 提交于 1月 14, 2016

When running the SPECint_rate gcc on some very large boxes it was
noticed that the system was spending lots of time in
mpol_shared_policy_lookup().  The gamess benchmark can also show it and
is what I mostly used to chase down the issue since the setup for that I
found to be easier.

To be clear the binaries were on tmpfs because of disk I/O requirements.
We then used text replication to avoid icache misses and having all the
copies banging on the memory where the instruction code resides.  This
results in us hitting a bottleneck in mpol_shared_policy_lookup() since
lookup is serialised by the shared_policy lock.

I have only reproduced this on very large (3k+ cores) boxes.  The
problem starts showing up at just a few hundred ranks getting worse
until it threatens to livelock once it gets large enough.  For example
on the gamess benchmark at 128 ranks this area consumes only ~1% of
time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over
90%.

To alleviate the contention in this area I converted the spinlock to an
rwlock.  This allows a large number of lookups to happen simultaneously.
The results were quite good reducing this consumtion at max ranks to
around 2%.

[akpm@linux-foundation.org: tidy up code comments]
Signed-off-by: NNathan Zimmer <nzimmer@sgi.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4a8c7bb5

mm: add PHYS_PFN, use it in __phys_to_pfn() · 8f235d1a

由 Chen Gang 提交于 1月 14, 2016

__phys_to_pfn and __pfn_to_phys are symmetric, PHYS_PFN and PFN_PHYS are
semmetric:

 - y = (phys_addr_t)x << PAGE_SHIFT

 - y >> PAGE_SHIFT = (phys_add_t)x

 - (unsigned long)(y >> PAGE_SHIFT) = x

[akpm@linux-foundation.org: use macro arg name `x']
[arnd@arndb.de: include linux/pfn.h for PHYS_PFN definition]
Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8f235d1a

mm/vmscan.c: change trace_mm_vmscan_writepage() proto type · 3aa23851

由 yalin wang 提交于 1月 14, 2016

Move trace_reclaim_flags() into trace function, so that we don't need
caculate these flags if the trace is disabled.
Signed-off-by: Nyalin wang <yalin.wang2010@gmail.com>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3aa23851

kmemcg: account certain kmem allocations to memcg · 5d097056

由 Vladimir Davydov 提交于 1月 14, 2016

Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg.  For the list, see below:

 - threadinfo
 - task_struct
 - task_delay_info
 - pid
 - cred
 - mm_struct
 - vm_area_struct and vm_region (nommu)
 - anon_vma and anon_vma_chain
 - signal_struct
 - sighand_struct
 - fs_struct
 - files_struct
 - fdtable and fdtable->full_fds_bits
 - dentry and external_name
 - inode for all filesystems. This is the most tedious part, because
   most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds.  Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5d097056

slab: add SLAB_ACCOUNT flag · 230e9fc2

由 Vladimir Davydov 提交于 1月 14, 2016

Currently, if we want to account all objects of a particular kmem cache,
we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is
inconvenient.  This patch introduces SLAB_ACCOUNT flag which if passed
to kmem_cache_create will force accounting for every allocation from
this cache even if __GFP_ACCOUNT is not passed.

This patch does not make any of the existing caches use this flag - it
will be done later in the series.

Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o
SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and
hence cannot have different sets of SLAB_* flags.  Thus using this flag
will probably reduce the number of merged slabs even if kmem accounting
is not used (only compiled in).
Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
Suggested-by: NTejun Heo <tj@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

230e9fc2

memcg: only account kmem allocations marked as __GFP_ACCOUNT · a9bb7e62

由 Vladimir Davydov 提交于 1月 14, 2016

Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
fragile and difficult to maintain, because there seem to be many more
allocations that should not be accounted than those that should be.
Besides, false accounting an allocation might result in much worse
consequences than not accounting at all, namely increased memory
consumption due to pinned dead kmem caches.

So this patch switches kmem accounting to the white-policy: now only
those kmem allocations that are marked as __GFP_ACCOUNT are accounted to
memcg.  Currently, no kmem allocations are marked like this.  The
following patches will mark several kmem allocations that are known to
be easily triggered from userspace and therefore should be accounted to
memcg.
Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a9bb7e62

Revert "gfp: add __GFP_NOACCOUNT" · 20b5c303

由 Vladimir Davydov 提交于 1月 14, 2016

This reverts commit 8f4fc071 ("gfp: add __GFP_NOACCOUNT").

Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be
fragile and difficult to maintain, because there seem to be many more
allocations that should not be accounted than those that should be.
Besides, false accounting an allocation might result in much worse
consequences than not accounting at all, namely increased memory
consumption due to pinned dead kmem caches.

So it was decided to switch to the white-list policy.  This patch
reverts bits introducing the black-list policy.  The white-list policy
will be introduced later in the series.
Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

20b5c303

include/linux/dcache.h: remove semicolons from HASH_LEN_DECLARE · 2bd03e49

由 Andrew Morton 提交于 1月 14, 2016

A little cleanup - the invocation site provdes the semicolon.

Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2bd03e49

fsnotify: destroy marks with call_srcu instead of dedicated thread · c510eff6

由 Jeff Layton 提交于 1月 14, 2016

At the time that this code was originally written, call_srcu didn't
exist, so this thread was required to ensure that we waited for that
SRCU grace period to settle before finally freeing the object.

It does exist now however and we can much more efficiently use call_srcu
to handle this.  That also allows us to potentially use srcu_barrier to
ensure that they are all of the callbacks have run before proceeding.
In order to conserve space, we union the rcu_head with the g_list.

This will be necessary for nfsd which will allocate marks from a
dedicated slabcache.  We have to be able to ensure that all of the
objects are destroyed before destroying the cache.  That's fairly
Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
Cc: Eric Paris <eparis@parisplace.org>
Reviewed-by: NJan Kara <jack@suse.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c510eff6

12 1月, 2016 9 次提交

net: Fix typo in netdev_intersect_features · b6a0e72a

由 Tom Herbert 提交于 1月 11, 2016

Obviously need to 'or in NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM.

Fixes: c8cd0989 ("net: Eliminate NETIF_F_GEN_CSUM and NETIF_F_V[46]_CSUM")
Reported-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b6a0e72a

IB/mlx5: Add flow steering support · 038d2ef8

由 Maor Gottlieb 提交于 1月 11, 2016

Adding flow steering support by creating a flow-table per
priority (if rules exist in the priority). mlx5_ib uses
autogrouping and thus only creates the required destinations.

Also includes adding of these flow steering utilities

1. Parsing verbs flow attributes hardware steering specs.

2. Check if flow is multicast - this is required in order to decide
to which flow table will we add the steering rule.

3. Set outer headers in flow match criteria to zeros.
Signed-off-by: NMaor Gottlieb <maorg@mellanox.com>
Signed-off-by: NMoni Shoua <monis@mellanox.com>
Signed-off-by: NMatan Barak <matanb@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

038d2ef8

net/mlx5_core: Make ipv4/ipv6 location more clear · b4d1f032

由 Maor Gottlieb 提交于 1月 11, 2016

Change the mlx5 firmware interface header to make it
more clear which bytes should be used by IPv4 or
IPv6 addresses.
Signed-off-by: NMaor Gottlieb <maorg@mellanox.com>
Signed-off-by: NMoni Shoua <monis@mellanox.com>
Signed-off-by: NMatan Barak <matanb@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b4d1f032

net/mlx5_core: Enable flow steering support for the IB driver · 4cbdd30e

由 Maor Gottlieb 提交于 1月 11, 2016

When the driver is loaded, we create flow steering namespace
for kernel bypass with nine priorities and another namespace
for leftovers(in order to catch packets that weren't matched).
Verbs applications will use these priorities.
we found nine as a number that balances the requirements from the
user and retains performance.

The bypass namespace is used by verbs applications that want to bypass
the kernel networking stack. The leftovers namespace is used by verbs
applications and the sniffer in order to catch packets that weren't
handled by any preceding rules.
Signed-off-by: NMaor Gottlieb <maorg@mellanox.com>
Signed-off-by: NMoni Shoua <monis@mellanox.com>
Signed-off-by: NMatan Barak <matanb@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4cbdd30e

net/mlx5_core: Introduce modify flow table command · 34a40e68

由 Maor Gottlieb 提交于 1月 11, 2016

Introduce the modify flow table command. This command is used when
we want to change the next flow table of an existing flow table.
The next flow table is defined as the table we search (in order
to find a match), if we couldn't find a match in any of the flow table
entries in the current flow table.
Signed-off-by: NMaor Gottlieb <maorg@mellanox.com>
Signed-off-by: NMoni Shoua <monis@mellanox.com>
Signed-off-by: NMatan Barak <matanb@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

34a40e68

net/mlx5_core: Managing root flow table · 2cc43b49

由 Maor Gottlieb 提交于 1月 11, 2016

The root Flow Table for each Flow Table Type is defined,
by default, as the Flow Table with level 0.

In order not to use an empty flow tables and introduce new hops,
but still preserve space for flow-tables that have a priority
greater(lower number) than the current flow table, we introduce this
new set root flow table command.
This command tells the HW to start matching packets from the
assigned root flow table.
This command is used when we create new flow table with level lower than the
current lowest flow table or it is the first flow table.
Signed-off-by: NMaor Gottlieb <maorg@mellanox.com>
Signed-off-by: NMoni Shoua <monis@mellanox.com>
Signed-off-by: NMatan Barak <matanb@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2cc43b49

net/mlx5_core: Introduce flow steering autogrouped flow table · f0d22d18

由 Maor Gottlieb 提交于 1月 11, 2016

When user add rule to autogrouped flow table, we search
for flow group with the same match criteria, if we don't
find such group then we create new flow group with the
required match criteria and insert the rule to this group.

We divide the flow table into required_groups + 1,
in order to reserve a part of the flow table for rules
which don't match any existing group.
Signed-off-by: NMaor Gottlieb <maorg@mellanox.com>
Signed-off-by: NMoni Shoua <monis@mellanox.com>
Signed-off-by: NMatan Barak <matanb@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f0d22d18

bpf: support ipv6 for bpf_skb_{set,get}_tunnel_key · c6c33454

由 Daniel Borkmann 提交于 1月 11, 2016

After IPv6 support has recently been added to metadata dst and related
encaps, add support for populating/reading it from an eBPF program.

Commit d3aa45ce ("bpf: add helpers to access tunnel metadata") started
with initial IPv4-only support back then (due to IPv6 metadata support
not being available yet).

To stay compatible with older programs, we need to test for the passed
structure size. Also TOS and TTL support from the ip_tunnel_info key has
been added. Tested with vxlan devs in collect meta data mode with IPv4,
IPv6 and in compat mode over different network namespaces.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c6c33454

bpf: export helper function flags and reject invalid ones · 781c53bc

由 Daniel Borkmann 提交于 1月 11, 2016

Export flags used by eBPF helper functions through UAPI, so they can be
used by programs (instead of them redefining all flags each time or just
using the hard-coded values). It also gives a better overview what flags
are used where and we can further get rid of the extra macros defined in
filter.c. Moreover, reject invalid flags.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

781c53bc

11 1月, 2016 7 次提交

[media] Postpone the addition of MEDIA_IOC_G_TOPOLOGY · be0270ec

由 Mauro Carvalho Chehab 提交于 12月 28, 2015

There are a few discussions left with regards to this ioctl:

1) the name of the new structs will contain _v2_ on it?
2) what's the best alternative to avoid compat32 issues?

Due to that, let's postpone the addition of this new ioctl to
the next Kernel version, to give people more time to discuss it.
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

be0270ec

[media] media-entitiy: add a function to create multiple links · b01cc9ce

由 Mauro Carvalho Chehab 提交于 12月 30, 2015

Sometimes, it is desired to create 1:n and n:1 or even
n:n links between different entities with the same
function.

This is actually needed to support DVB devices that
have multiple frontends. While we could do a function
like that internally at the DVB core, such function is
generic enough to be at media-entity, and it could be
useful on some other places.

So, add such function.
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

b01cc9ce

[media] uapi/media.h: Use u32 for the number of graph objects · 7c9d6731

由 Mauro Carvalho Chehab 提交于 12月 17, 2015

While we need to keep a u64 alignment to avoid compat32 issues,
having the number of entities/pads/links/interfaces represented
by an u64 is incoherent with the ID number, with is an u32.

In order to make it coherent, change those quantities to u32.
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

7c9d6731

[media] media-entity.h: document the remaining functions · 630c0e80

由 Mauro Carvalho Chehab 提交于 12月 16, 2015

There are two ancillary functions that are missing comments.

While those are used only internally at media-entity.c,
document them, for completeness.
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

630c0e80

[media] media-device.h: use just one u32 counter for object ID · 05b3b77c

由 Mauro Carvalho Chehab 提交于 12月 16, 2015

Instead of using one u32 counter per type for object IDs, use
just one counter. With such change, it makes sense to simplify
the debug logs too.
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

05b3b77c

[media] media-entity.h fix documentation for several parameters · 03e49338

由 Mauro Carvalho Chehab 提交于 12月 16, 2015

Several parameters added by the media_ent_enum patches
were declared with wrong argument names:
include/media/media-device.h:333: warning: No description found for parameter 'entity_internal_idx_max'
include/media/media-device.h:354: warning: No description found for parameter 'ent_enum'
include/media/media-device.h:354: warning: Excess function parameter 'e' description in 'media_entity_enum_init'
include/media/media-device.h:333: warning: No description found for parameter 'entity_internal_idx_max'
include/media/media-device.h:354: warning: No description found for parameter 'ent_enum'
include/media/media-device.h:354: warning: Excess function parameter 'e' description in 'media_entity_enum_init'
include/media/media-entity.h:397: warning: No description found for parameter 'ent_enum'
include/media/media-entity.h:397: warning: Excess function parameter 'e' description in 'media_entity_enum_zero'
include/media/media-entity.h:409: warning: No description found for parameter 'ent_enum'
include/media/media-entity.h:409: warning: Excess function parameter 'e' description in 'media_entity_enum_set'
include/media/media-entity.h:424: warning: No description found for parameter 'ent_enum'
include/media/media-entity.h:424: warning: Excess function parameter 'e' description in 'media_entity_enum_clear'
include/media/media-entity.h:441: warning: No description found for parameter 'ent_enum'
include/media/media-entity.h:441: warning: Excess function parameter 'e' description in 'media_entity_enum_test'
include/media/media-entity.h:458: warning: No description found for parameter 'ent_enum'
include/media/media-entity.h:458: warning: Excess function parameter 'e' description in 'media_entity_enum_test_and_set'
include/media/media-entity.h:474: warning: No description found for parameter 'ent_enum'
include/media/media-entity.h:474: warning: Excess function parameter 'e' description in 'media_entity_enum_empty'
include/media/media-entity.h:474: warning: Excess function parameter 'entity' description in 'media_entity_enum_empty'
include/media/media-entity.h:489: warning: No description found for parameter 'ent_enum1'
include/media/media-entity.h:489: warning: No description found for parameter 'ent_enum2'
include/media/media-entity.h:489: warning: Excess function parameter 'e' description in 'media_entity_enum_intersects'
include/media/media-entity.h:489: warning: Excess function parameter 'f' description in 'media_entity_enum_intersects'

Fix them.
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

03e49338

[media] DocBook: document media_entity_graph_walk_cleanup() · aa360d3d

由 Mauro Carvalho Chehab 提交于 12月 16, 2015

This function was added recently, but weren't documented.
Add documentation for it.
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

aa360d3d