提交 · a6ffdc07847e74cc244c02ab6d0351a4a5d77281 · openanolis / cloud-kernel

04 5月, 2017 7 次提交

mm: use is_migrate_highatomic() to simplify the code · a6ffdc07

由 Xishi Qiu 提交于 5月 03, 2017

Introduce two helpers, is_migrate_highatomic() and is_migrate_highatomic_page().

Simplify the code, no functional changes.

[akpm@linux-foundation.org: use static inlines rather than macros, per mhocko]
Link: http://lkml.kernel.org/r/58B94F15.6060606@huawei.comSigned-off-by: NXishi Qiu <qiuxishi@huawei.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a6ffdc07

mm: memcontrol: provide shmem statistics · 9a4caf1e

由 Johannes Weiner 提交于 5月 03, 2017

Cgroups currently don't report how much shmem they use, which can be
useful data to have, in particular since shmem is included in the
cache/file item while being reclaimed like anonymous memory.

Add a counter to track shmem pages during charging and uncharging.

Link: http://lkml.kernel.org/r/20170221164343.32252-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Reported-by: NChris Down <cdown@fb.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9a4caf1e

mm: reclaim MADV_FREE pages · 802a3a92

由 Shaohua Li 提交于 5月 03, 2017

When memory pressure is high, we free MADV_FREE pages.  If the pages are
not dirty in pte, the pages could be freed immediately.  Otherwise we
can't reclaim them.  We put the pages back to anonumous LRU list (by
setting SwapBacked flag) and the pages will be reclaimed in normal
swapout way.

We use normal page reclaim policy.  Since MADV_FREE pages are put into
inactive file list, such pages and inactive file pages are reclaimed
according to their age.  This is expected, because we don't want to
reclaim too many MADV_FREE pages before used once pages.

Based on Minchan's original patch

[minchan@kernel.org: clean up lazyfree page handling]
  Link: http://lkml.kernel.org/r/20170303025237.GB3503@bbox
Link: http://lkml.kernel.org/r/14b8eb1d3f6bf6cc492833f183ac8c304e560484.1487965799.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NMinchan Kim <minchan@kernel.org>
Acked-by: NMinchan Kim <minchan@kernel.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

802a3a92

mm: move MADV_FREE pages into LRU_INACTIVE_FILE list · f7ad2a6c

由 Shaohua Li 提交于 5月 03, 2017

madv()'s MADV_FREE indicate pages are 'lazyfree'.  They are still
anonymous pages, but they can be freed without pageout.  To distinguish
these from normal anonymous pages, we clear their SwapBacked flag.

MADV_FREE pages could be freed without pageout, so they pretty much like
used once file pages.  For such pages, we'd like to reclaim them once
there is memory pressure.  Also it might be unfair reclaiming MADV_FREE
pages always before used once file pages and we definitively want to
reclaim the pages before other anonymous and file pages.

To speed up MADV_FREE pages reclaim, we put the pages into
LRU_INACTIVE_FILE list.  The rationale is LRU_INACTIVE_FILE list is tiny
nowadays and should be full of used once file pages.  Reclaiming
MADV_FREE pages will not have much interfere of anonymous and active
file pages.  And the inactive file pages and MADV_FREE pages will be
reclaimed according to their age, so we don't reclaim too many MADV_FREE
pages too.  Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
means we can reclaim the pages without swap support.  This idea is
suggested by Johannes.

This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
avoid bisect failure, next patch will do it.

The patch is based on Minchan's original patch.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMinchan Kim <minchan@kernel.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f7ad2a6c

mm: delete unnecessary TTU_* flags · a128ca71

由 Shaohua Li 提交于 5月 03, 2017

Patch series "mm: fix some MADV_FREE issues", v5.

We are trying to use MADV_FREE in jemalloc.  Several issues are found.
Without solving the issues, jemalloc can't use the MADV_FREE feature.

 - Doesn't support system without swap enabled. Because if swap is off,
   we can't or can't efficiently age anonymous pages. And since
   MADV_FREE pages are mixed with other anonymous pages, we can't
   reclaim MADV_FREE pages. In current implementation, MADV_FREE will
   fallback to MADV_DONTNEED without swap enabled. But in our
   environment, a lot of machines don't enable swap. This will prevent
   our setup using MADV_FREE.

 - Increases memory pressure. page reclaim bias file pages reclaim
   against anonymous pages. This doesn't make sense for MADV_FREE pages,
   because those pages could be freed easily and refilled with very
   slight penality. Even page reclaim doesn't bias file pages, there is
   still an issue, because MADV_FREE pages and other anonymous pages are
   mixed together. To reclaim a MADV_FREE page, we probably must scan a
   lot of other anonymous pages, which is inefficient. In our test, we
   usually see oom with MADV_FREE enabled and nothing without it.

 - Accounting. There are two accounting problems. We don't have a global
   accounting. If the system is abnormal, we don't know if it's a
   problem from MADV_FREE side. The other problem is RSS accounting.
   MADV_FREE pages are accounted as normal anon pages and reclaimed
   lazily, so application's RSS becomes bigger. This confuses our
   workloads. We have monitoring daemon running and if it finds
   applications' RSS becomes abnormal, the daemon will kill the
   applications even kernel can reclaim the memory easily.

To address the first the two issues, we can either put MADV_FREE pages
into a separate LRU list (Minchan's previous patches and V1 patches), or
put them into LRU_INACTIVE_FILE list (suggested by Johannes).  The
patchset use the second idea.  The reason is LRU_INACTIVE_FILE list is
tiny nowadays and should be full of used once file pages.  So we can
still efficiently reclaim MADV_FREE pages there without interference
with other anon and active file pages.  Putting the pages into inactive
file list also has an advantage which allows page reclaim to prioritize
MADV_FREE pages and used once file pages.  MADV_FREE pages are put into
the lru list and clear SwapBacked flag, so PageAnon(page) &&
!PageSwapBacked(page) will indicate a MADV_FREE pages.  These pages will
directly freed without pageout if they are clean, otherwise normal swap
will reclaim them.

For the third issue, the previous post adds global accounting and a
separate RSS count for MADV_FREE pages.  The problem is we never get
accurate accounting for MADV_FREE pages.  The pages are mapped to
userspace, can be dirtied without notice from kernel side.  To get
accurate accounting, we could write protect the page, but then there is
extra page fault overhead, which people don't want to pay.  Jemalloc
guys have concerns about the inaccurate accounting, so this post drops
the accounting patches temporarily.  The info exported to
/proc/pid/smaps for MADV_FREE pages are kept, which is the only place we
can get accurate accounting right now.

This patch (of 6):

Johannes pointed out TTU_LZFREE is unnecessary.  It's true because we
always have the flag set if we want to do an unmap.  For cases we don't
do an unmap, the TTU_LZFREE part of code should never run.

Also the TTU_UNMAP is unnecessary.  If no other flags set (for example,
TTU_MIGRATION), an unmap is implied.

The patch includes Johannes's cleanup and dead TTU_ACTION macro removal
code

Link: http://lkml.kernel.org/r/4be3ea1bc56b26fd98a54d0a6f70bec63f6d8980.1487965799.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMinchan Kim <minchan@kernel.org>
Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a128ca71

mm: delete NR_PAGES_SCANNED and pgdat_reclaimable() · c822f622

由 Johannes Weiner 提交于 5月 03, 2017

NR_PAGES_SCANNED counts number of pages scanned since the last page free
event in the allocator.  This was used primarily to measure the
reclaimability of zones and nodes, and determine when reclaim should
give up on them.  In that role, it has been replaced in the preceding
patches by a different mechanism.

Being implemented as an efficient vmstat counter, it was automatically
exported to userspace as well.  It's however unlikely that anyone
outside the kernel is using this counter in any meaningful way.

Remove the counter and the unused pgdat_reclaimable().

Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Jia He <hejianet@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c822f622

mm: fix 100% CPU kswapd busyloop on unreclaimable nodes · c73322d0

由 Johannes Weiner 提交于 5月 03, 2017

Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
cleanups".

Jia reported a scenario in which the kswapd of a node indefinitely spins
at 100% CPU usage. We have seen similar cases at Facebook.

The kernel's current method of judging its ability to reclaim a node (or
whether to back off and sleep) is based on the amount of scanned pages
in proportion to the amount of reclaimable pages. In Jia's and our
scenarios, there are no reclaimable pages in the node, however, and the
condition for backing off is never met. Kswapd busyloops in an attempt
to restore the watermarks while having nothing to work with.

This series reworks the definition of an unreclaimable node based not on
scanning but on whether kswapd is able to actually reclaim pages in
MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria
the page allocator uses for giving up on direct reclaim and invoking the
OOM killer. If it cannot free any pages, kswapd will go to sleep and
leave further attempts to direct reclaim invocations, which will either
make progress and re-enable kswapd, or invoke the OOM killer.

Patch #1 fixes the immediate problem Jia reported, the remainder are
smaller fixlets, cleanups, and overall phasing out of the old method.

Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(),
and directly related to #5, but in itself not relevant to the series.

If the whole series is too ambitious for 4.11, I would consider the
first three patches fixes, the rest cleanups.

This patch (of 9):

Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:

$ echo 4000 >/proc/sys/vm/nr_hugepages

top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01
Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers
KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3

At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.

Kswapd needs to back away from nodes that fail to balance. Up until
commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
nodes") kswapd had such a mechanism. It considered zones whose
theoretically reclaimable pages it had reclaimed six times over as
unreclaimable and backed away from them. This guard was erroneously
removed as the patch changed the definition of a balanced node.

However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met. Kswapd would stay awake anyway.

Introduce a new and much simpler way of backing off. If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node. This is the same number of shots
direct reclaim takes before declaring OOM. Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.

[hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
[shakeelb@google.com: fix condition for throttle_direct_reclaim]
Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NShakeel Butt <shakeelb@google.com>
Reported-by: NJia He <hejianet@gmail.com>
Tested-by: NJia He <hejianet@gmail.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: NMinchan Kim <minchan@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c73322d0

03 5月, 2017 1 次提交

qed*: Fix issues in the ptp filter config implementation. · 8d3f87d8

由 sudarsana.kalluru@cavium.com 提交于 5月 02, 2017

PTP hardware filter configuration performed by the driver for a given
user requested config is not correct for some of the PTP modes.
Following changes are needed for PTP config-filter implementation.
 1. NIG_REG_TX_PTP_EN register - Bits 0/1/2 respectively enables
    TimeSync/"V1 frame format support"/"V2 frame format support" on
    the TX side. Set the associated bits based on the user request.
 2. ptp4l application fails to operate in Peer Delay mode. Following
    changes are needed to fix this,
    a. Driver should enable (set to 0) DA #1-related bits for IPv4,
       IPv6 and MAC destination addresses in these registers:
         NIG_REG_TX_LLH_PTP_RULE_MASK
         NIG_REG_LLH_PTP_RULE_MASK
    b. NIG_REG_LLH_PTP_PARAM_MASK/NIG_REG_TX_LLH_PTP_PARAM_MASK should
       be set to 0x0 in all modes.
Signed-off-by: NSudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
Signed-off-by: NYuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8d3f87d8

02 5月, 2017 3 次提交

audit: Use timespec64 to represent audit timestamps · 2115bb25

由 Deepa Dinamani 提交于 5月 02, 2017

struct timespec is not y2038 safe.
Audit timestamps are recorded in string format into
an audit buffer for a given context.
These mark the entry timestamps for the syscalls.
Use y2038 safe struct timespec64 to represent the times.
The log strings can handle this transition as strings can
hold upto 1024 characters.
Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NPaul Moore <paul@paul-moore.com>
Acked-by: NRichard Guy Briggs <rgb@redhat.com>
Signed-off-by: NPaul Moore <paul@paul-moore.com>

2115bb25

audit: kernel generated netlink traffic should have a portid of 0 · 45a0642b

由 Paul Moore 提交于 5月 02, 2017

We were setting the portid incorrectly in the netlink message headers,
fix that to always be 0 (nlmsg_pid = 0).
Signed-off-by: NPaul Moore <paul@paul-moore.com>
Reviewed-by: NRichard Guy Briggs <rgb@redhat.com>

45a0642b

rhashtable: compact struct rhashtable_params · 48e75b43

由 Florian Westphal 提交于 5月 01, 2017

By using smaller datatypes this (rather large) struct shrinks considerably
(80 -> 48 bytes on x86_64).

As this is embedded in other structs, this also rerduces size of several
others, e.g. cls_fl_head or nft_hash.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

48e75b43

01 5月, 2017 4 次提交

qed: output the DPM status and WID count · 20b1bd96

由 Ram Amrani 提交于 4月 30, 2017

Output to the RDMA driver whether DPM mode is enabled or disabled in
the HW and if so what is the number of WIDs it supports
Signed-off-by: NRam Amrani <Ram.Amrani@cavium.com>
Signed-off-by: NYuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

20b1bd96

xdp: propagate extended ack to XDP setup · ddf9f970

由 Jakub Kicinski 提交于 4月 30, 2017

Drivers usually have a number of restrictions for running XDP
- most common being buffer sizes, LRO and number of rings.
Even though some drivers try to be helpful and print error
messages experience shows that users don't often consult
kernel logs on netlink errors.  Try to use the new extended
ack mechanism to carry the message back to user space.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ddf9f970

netlink: add NULL-friendly helper for setting extended ACK message · 45d9b378

由 Jakub Kicinski 提交于 4月 30, 2017

As we propagate extended ack reporting throughout various paths in
the kernel it may be that the same function is called with the
extended ack parameter passed as NULL.  One place where that happens
is in drivers which have a centralized reconfiguration function
called both from ndos and from ethtool_ops.  Add a new helper for
setting the error message in such conditions.

Existing helper is left as is to encourage propagating the ext act
fully wherever possible.  It also makes it clear in the code which
messages may be lost due to ext ack being NULL.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

45d9b378

mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash · 71389703

由 Dan Williams 提交于 4月 28, 2017

The x86 conversion to the generic GUP code included a small change which causes
crashes and data corruption in the pmem code - not good.

The root cause is that the /dev/pmem driver code implicitly relies on the x86
get_user_pages() implementation doing a get_page() on the page refcount, because
get_page() does a get_zone_device_page() which properly refcounts pmem's separate
page struct arrays that are not present in the regular page struct structures.
(The pmem driver does this because it can cover huge memory areas.)

But the x86 conversion to the generic GUP code changed the get_page() to
page_cache_get_speculative() which is faster but doesn't do the
get_zone_device_page() call the pmem code relies on.

One way to solve the regression would be to change the generic GUP code to use
get_page(), but that would slow things down a bit and punish other generic-GUP
using architectures for an x86-ism they did not care about. (Arguably the pmem
driver was probably not working reliably for them: but nvdimm is an Intel
feature, so non-x86 exposure is probably still limited.)

So restructure the pmem code's interface with the MM instead: get rid of the
get/put_zone_device_page() distinction, integrate put_zone_device_page() into
__put_page() and and restructure the pmem completion-wait and teardown machinery:

Kirill points out that the calls to {get,put}_dev_pagemap() can be
removed from the mm fast path if we take a single get_dev_pagemap()
reference to signify that the page is alive and use the final put of the
page to drop that reference.

This does require some care to make sure that any waits for the
percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
since it now maintains its own elevated reference.

This speeds up things while also making the pmem refcounting more robust going
forward.
Suggested-by: NKirill Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: NKirill Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

71389703

30 4月, 2017 1 次提交

net/mlx5e: Update neighbour 'used' state using HW flow rules counters · f6dfb4c3

由 Hadar Hen Zion 提交于 2月 24, 2017

When IP tunnel encapsulation rules are offloaded, the kernel can't see
the traffic of the offloaded flow. The neighbour for the IP tunnel
destination of the offloaded flow can mistakenly become STALE and
deleted by the kernel since its 'used' value wasn't changed.

To make sure that a neighbour which is used by the HW won't become
STALE, we proactively update the neighbour 'used' value every
DELAY_PROBE_TIME period, when packets were matched and counted by the HW
for one of the tunnel encap flows related to this neighbour.

The periodic task that updates the used neighbours is scheduled when a
tunnel encap rule is successfully offloaded into HW and keeps re-scheduling
itself as long as the representor's neighbours list isn't empty.

Add, remove, lookup and status change operations done over the
representor's neighbours list or the neighbour hash entry encaps list
are all serialized by RTNL lock.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

f6dfb4c3

29 4月, 2017 1 次提交

libata: remove SCT WRITE SAME support · 63ccc191

由 Christoph Hellwig 提交于 4月 25, 2017

This was already disabled a while ago because it caused I/O errors,
and it's severly getting into the way of the discard / write zeroes
rework.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

63ccc191

28 4月, 2017 6 次提交

blk-mq: unify hctx delay_work and run_work · 21c6e939

由 Jens Axboe 提交于 4月 10, 2017

The only difference between ->run_work and ->delay_work, is that
the latter is used to defer running a queue. This is done by
marking the queue stopped, and scheduling ->delay_work to run
sometime in the future. While the queue is stopped, direct runs
or runs through ->run_work will not run the queue.

If we combine the handlers, then we need to handle two things:

1) If a delayed/stopped run is scheduled, then we should not run
   the queue before that has been completed.
2) If a queue is delayed/stopped, the handler needs to restart
   the queue. Normally a run of a queue with the stopped bit set
   would be a no-op.

Case 1 is handled by modifying a currently pending queue run
to the deadline set by the caller of blk_mq_delay_queue().
Subsequent attempts to queue a queue run will find the work
item already pending, and direct runs will see a stopped queue
as before.

Case 2 is handled by adding a new bit, BLK_MQ_S_START_ON_RUN,
that tells the work handler that it should clear a stopped
queue and run the handler.
Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

21c6e939

block: add kblock_mod_delayed_work_on() · 818cd1cb

由 Jens Axboe 提交于 4月 10, 2017

This modifies (or adds, if not currently pending) an existing
delayed work item.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

818cd1cb

blk-mq: unify hctx delayed_run_work and run_work · 9f993737

由 Jens Axboe 提交于 4月 10, 2017

They serve the exact same purpose. Get rid of the non-delayed
work variant, and just run it without delay for the normal case.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

9f993737

mac80211: Add support for BSS max idle period element · e38a017b

由 Avraham Stern 提交于 4月 26, 2017

Parse the BSS max idle period element and set the BSS configuration
accordingly so the driver can use this information to configure the
max idle period and to use protected management frames for keep alive
when required.

The BSS max idle period element is defined in IEEE802.11-2016,
section 9.4.2.79
Signed-off-by: NAvraham Stern <avraham.stern@intel.com>
Signed-off-by: NLuca Coelho <luciano.coelho@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

e38a017b

bridge: add per-port broadcast flood flag · 99f906e9

由 Mike Manning 提交于 4月 26, 2017

Support for l2 multicast flood control was added in commit b6cb5ac8
("net: bridge: add per-port multicast flood flag"). It allows broadcast
as it was introduced specifically for unknown multicast flood control.
But as broadcast is a special case of multicast, this may also need to
be disabled. For this purpose, introduce a flag to disable the flooding
of received l2 broadcasts. This approach is backwards compatible and
provides flexibility in filtering for the desired packet types.

Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NMike Manning <mmanning@brocade.com>
Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

99f906e9

net: update comment for netif_dormant() function · 8ecbc40a

由 Zhang Shengju 提交于 4月 26, 2017

This patch updates the comment for netif_dormant() function to reflect
the intended usage.
Signed-off-by: NZhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8ecbc40a

27 4月, 2017 10 次提交

rhashtable: Cap total number of entries to 2^31 · 6d684e54

由 Herbert Xu 提交于 4月 27, 2017

When max_size is not set or if it set to a sufficiently large
value, the nelems counter can overflow.  This would cause havoc
with the automatic shrinking as it would then attempt to fit a
huge number of entries into a tiny hash table.

This patch fixes this by adding max_elems to struct rhashtable
to cap the number of elements.  This is set to 2^31 as nelems is
not a precise count.  This is sufficiently smaller than UINT_MAX
that it should be safe.

When max_size is set max_elems will be lowered to at most twice
max_size as is the status quo.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6d684e54

ieee80211: fix kernel-doc parsing errors · f6601e17

由 Johannes Berg 提交于 4月 26, 2017

Some of the enum definitions are unnamed but there's still
an attempt at documenting them - that doesn't work. Name
them to make that work.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

f6601e17

ieee80211: add FT-802.1X AKM suite selector · 2ead3235

由 Luca Coelho 提交于 4月 26, 2017

Add the definition for FT-8021.1X AKM selector as defined in
IEEE Std 802.11-2016, table 9-133.
Signed-off-by: NLuca Coelho <luciano.coelho@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

2ead3235

ieee80211: add SUITE_B AKM selectors · 1cbf41db

由 Luca Coelho 提交于 4月 26, 2017

Add the definitions for SUITE_B and SUITE_B_192 AKM selectors as
defined in IEEE802.11REVmc_D5.0, table 9-132.
Signed-off-by: NLuca Coelho <luciano.coelho@intel.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

1cbf41db

blk-mq: Add blk_mq_ops.show_rq() · 2836ee4b

由 Bart Van Assche 提交于 4月 26, 2017

This new callback function will be used in the next patch to show
more information about SCSI requests.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

2836ee4b

tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps · 645f4c6f

由 Eric Dumazet 提交于 4月 25, 2017

Some devices or distributions use HZ=100 or HZ=250

TCP receive buffer autotuning has poor behavior caused by this choice.
Since autotuning happens after 4 ms or 10 ms, short distance flows
get their receive buffer tuned to a very high value, but after an initial
period where it was frozen to (too small) initial value.

With tp->tcp_mstamp introduction, we can switch to high resolution
timestamps almost for free (at the expense of 8 additional bytes per
TCP structure)

Note that some TCP stacks use usec TCP timestamps where this
patch makes even more sense : Many TCP flows have < 500 usec RTT.
Hopefully this finer TS option can be standardized soon.

Tested:
 HZ=100 kernel
 ./netperf -H lpaa24 -t TCP_RR -l 1000 -- -r 10000,10000 &

 Peer without patch :
 lpaa24:~# ss -tmi dst lpaa23
 ...
 skmem:(r0,rb8388608,...)
 rcv_rtt:10 rcv_space:3210000 minrtt:0.017

 Peer with the patch :
 lpaa23:~# ss -tmi dst lpaa24
 ...
 skmem:(r0,rb428800,...)
 rcv_rtt:0.069 rcv_space:30000 minrtt:0.017

We can see saner RCVBUF, and more precise rcv_rtt information.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

645f4c6f

tcp: add tp->tcp_mstamp field · 69e996c5

由 Eric Dumazet 提交于 4月 25, 2017

We want to use precise timestamps in TCP stack, but we do not
want to call possibly expensive kernel time services too often.

tp->tcp_mstamp is guaranteed to be updated once per incoming packet.

We will use it in the following patches, removing specific
skb_mstamp_get() calls, and removing ack_time from
struct tcp_sacktag_state.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

69e996c5

rhashtable: remove insecure_max_entries param · 038a3e85

由 Florian Westphal 提交于 4月 25, 2017

no users in the tree, insecure_max_entries is always set to
ht->p.max_size * 2 in rhtashtable_init().

Replace only spot that uses it with a ht->p.max_size check.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

038a3e85

net: phy: fix auto-negotiation stall due to unavailable interrupt · f555f34f

由 Alexander Kochetkov 提交于 4月 20, 2017

The Ethernet link on an interrupt driven PHY was not coming up if the Ethernet
cable was plugged before the Ethernet interface was brought up.

The patch trigger PHY state machine to update link state if PHY was requested to
do auto-negotiation and auto-negotiation complete flag already set.

During power-up cycle the PHY do auto-negotiation, generate interrupt and set
auto-negotiation complete flag. Interrupt is handled by PHY state machine but
doesn't update link state because PHY is in PHY_READY state. After some time
MAC bring up, start and request PHY to do auto-negotiation. If there are no new
settings to advertise genphy_config_aneg() doesn't start PHY auto-negotiation.
PHY continue to stay in auto-negotiation complete state and doesn't fire
interrupt. At the same time PHY state machine expect that PHY started
auto-negotiation and is waiting for interrupt from PHY and it won't get it.

Fixes: 321beec5 ("net: phy: Use interrupts when available in NOLINK state")
Signed-off-by: NAlexander Kochetkov <al.kochet@gmail.com>
Cc: stable <stable@vger.kernel.org> # v4.9+
Tested-by: NRoger Quadros <rogerq@ti.com>
Tested-by: NAlexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f555f34f

A
CONFIG_ARCH_HAS_RAW_COPY_USER is unconditional now · 701cac61
由 Al Viro 提交于 4月 05, 2017
```
all architectures converted
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
701cac61

26 4月, 2017 3 次提交

ebtables: remove nf_hook_register usage · aee12a0a

由 Florian Westphal 提交于 4月 20, 2017

Similar to ip_register_table, pass nf_hook_ops to ebt_register_table().
This allows to handle hook registration also via pernet_ops and allows
us to avoid use of legacy register_hook api.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

aee12a0a

net: move xdp_prog field in RX cache lines · 7acedaf5

由 Eric Dumazet 提交于 4月 25, 2017

(struct net_device, xdp_prog) field should be moved in RX cache lines,
reducing latencies when a single packet is received on idle host,
since netif_elide_gro() needs it.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7acedaf5

net: Generic XDP · b5cdae32

由 David S. Miller 提交于 4月 18, 2017

This provides a generic SKB based non-optimized XDP path which is used
if either the driver lacks a specific XDP implementation, or the user
requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE.

It is arguable that perhaps I should have required something like
this as part of the initial XDP feature merge.

I believe this is critical for two reasons:

1) Accessibility.  More people can play with XDP with less
   dependencies.  Yes I know we have XDP support in virtio_net, but
   that just creates another depedency for learning how to use this
   facility.

   I wrote this to make life easier for the XDP newbies.

2) As a model for what the expected semantics are.  If there is a pure
   generic core implementation, it serves as a semantic example for
   driver folks adding XDP support.

One thing I have not tried to address here is the issue of
XDP_PACKET_HEADROOM, thanks to Daniel for spotting that.  It seems
incredibly expensive to do a skb_cow(skb, XDP_PACKET_HEADROOM) or
whatever even if the XDP program doesn't try to push headers at all.
I think we really need the verifier to somehow propagate whether
certain XDP helpers are used or not.

v5:
 - Handle both negative and positive offset after running prog
 - Fix mac length in XDP_TX case (Alexei)
 - Use rcu_dereference_protected() in free_netdev (kbuild test robot)

v4:
 - Fix MAC header adjustmnet before calling prog (David Ahern)
 - Disable LRO when generic XDP is installed (Michael Chan)
 - Bypass qdisc et al. on XDP_TX and record the event (Alexei)
 - Do not perform generic XDP on reinjected packets (DaveM)

v3:
 - Make sure XDP program sees packet at MAC header, push back MAC
   header if we do XDP_TX.  (Alexei)
 - Elide GRO when generic XDP is in use.  (Alexei)
 - Add XDP_FLAG_SKB_MODE flag which the user can use to request generic
   XDP even if the driver has an XDP implementation.  (Alexei)
 - Report whether SKB mode is in use in rtnl_xdp_fill() via XDP_FLAGS
   attribute.  (Daniel)

v2:
 - Add some "fall through" comments in switch statements based
   upon feedback from Andrew Lunn
 - Use RCU for generic xdp_prog, thanks to Johannes Berg.
Tested-by: NAndy Gospodarek <andy@greyhouse.net>
Tested-by: NJesper Dangaard Brouer <brouer@redhat.com>
Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b5cdae32

25 4月, 2017 4 次提交

qed/qede: Add UDP ports in bulletin board · 97379f15

由 Chopra, Manish 提交于 4月 24, 2017

This patch adds support for UDP ports in bulletin board
to notify UDP ports change to the VFs
Signed-off-by: NManish Chopra <manish.chopra@cavium.com>
Signed-off-by: NYuval Mintz <yuval.mintz@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

97379f15

qed/qede: Enable tunnel offloads based on hw configuration · 19489c7f

由 Chopra, Manish 提交于 4月 24, 2017

This patch enables tunnel feature offloads based on hw configuration
at initialization time instead of enabling them always.
Signed-off-by: NManish Chopra <manish.chopra@cavium.com>
Signed-off-by: NYuval Mintz <yuval.mintz@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

19489c7f

can: complete initial namespace support · cb5635a3

由 Oliver Hartkopp 提交于 4月 25, 2017

The statistics and its proc output was not implemented as per-net in the
initial network namespace support by Mario Kicherer (8e8cda6d).
This patch adds the missing per-net statistics for the CAN subsystem.
Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>

cb5635a3

can: peak: add support for PEAK PCAN-PCIe FD CAN-FD boards · 8ac8321e

由 Stephane Grosjean 提交于 1月 19, 2017

This patch adds the support of the PCAN-PCI Express FD boards made
by PEAK-System, for computers using the PCI Express slot.

The PCAN-PCI Express FD has one or two CAN FD channels, depending
on the model. A galvanic isolation of the CAN ports protects
the electronics of the card and the respective computer against
disturbances of up to 500 Volts. The PCAN-PCI Express FD can be operated
with ambient temperatures in a range of -40 to +85 °C.

Such boards run an extented version of the CAN-FD IP running into USB
CAN-FD interfaces from PEAK-System, so this patch adds several new commands
and their corresponding data types to the PEAK CAN-FD common definitions
header file too.
Signed-off-by: NStephane Grosjean <s.grosjean@peak-system.com>
Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>

8ac8321e

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功