提交 · dcda9b04713c3f6ff0875652924844fae28286ea · openanolis / cloud-kernel

13 7月, 2017 1 次提交

mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04

由 Michal Hocko 提交于 7月 12, 2017

__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
the page allocator.  This has been true but only for allocations
requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
ignored for smaller sizes.  This is a bit unfortunate because there is
no way to express the same semantic for those requests and they are
considered too important to fail so they might end up looping in the
page allocator for ever, similarly to GFP_NOFAIL requests.

Now that the whole tree has been cleaned up and accidental or misled
usage of __GFP_REPEAT flag has been removed for !costly requests we can
give the original flag a better name and more importantly a more useful
semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
that the allocator would try really hard but there is no promise of a
success.  This will work independent of the order and overrides the
default allocator behavior.  Page allocator users have several levels of
guarantee vs.  cost options (take GFP_KERNEL as an example)

 - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
   attempt to free memory at all. The most light weight mode which even
   doesn't kick the background reclaim. Should be used carefully because
   it might deplete the memory and the next user might hit the more
   aggressive reclaim

 - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
   allocation without any attempt to free memory from the current
   context but can wake kswapd to reclaim memory if the zone is below
   the low watermark. Can be used from either atomic contexts or when
   the request is a performance optimization and there is another
   fallback for a slow path.

 - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
   non sleeping allocation with an expensive fallback so it can access
   some portion of memory reserves. Usually used from interrupt/bh
   context with an expensive slow path fallback.

 - GFP_KERNEL - both background and direct reclaim are allowed and the
   _default_ page allocator behavior is used. That means that !costly
   allocation requests are basically nofail but there is no guarantee of
   that behavior so failures have to be checked properly by callers
   (e.g. OOM killer victim is allowed to fail currently).

 - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
   and all allocation requests fail early rather than cause disruptive
   reclaim (one round of reclaim in this implementation). The OOM killer
   is not invoked.

 - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
   behavior and all allocation requests try really hard. The request
   will fail if the reclaim cannot make any progress. The OOM killer
   won't be triggered.

 - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
   and all allocation requests will loop endlessly until they succeed.
   This might be really dangerous especially for larger orders.

Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
because they already had their semantic.  No new users are added.
__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
there is no progress and we have already passed the OOM point.

This means that all the reclaim opportunities have been exhausted except
the most disruptive one (the OOM killer) and a user defined fallback
behavior is more sensible than keep retrying in the page allocator.

[akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
[mhocko@suse.com: semantic fix]
  Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
[mhocko@kernel.org: address other thing spotted by Vlastimil]
  Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Alex Belits <alex.belits@cavium.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: NeilBrown <neilb@suse.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dcda9b04

08 7月, 2017 1 次提交

bonding: avoid NETDEV_CHANGEMTU event when unregistering slave · f51048c3

由 WANG Cong 提交于 7月 06, 2017

As Hongjun/Nicolas summarized in their original patch:

"
When a device changes from one netns to another, it's first unregistered,
then the netns reference is updated and the dev is registered in the new
netns. Thus, when a slave moves to another netns, it is first
unregistered. This triggers a NETDEV_UNREGISTER event which is caught by
the bonding driver. The driver calls bond_release(), which calls
dev_set_mtu() and thus triggers NETDEV_CHANGEMTU (the device is still in
the old netns).
"

This is a very special case, because the device is being unregistered
no one should still care about the NETDEV_CHANGEMTU event triggered
at this point, we can avoid broadcasting this event on this path,
and avoid touching inetdev_event()/addrconf_notify() path.

It requires to export __dev_set_mtu() to bonding driver.
Reported-by: NHongjun Li <hongjun.li@6wind.com>
Reported-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f51048c3

03 7月, 2017 1 次提交

net: core: Fix slab-out-of-bounds in netdev_stats_to_stats64 · 9af9959e

由 Alban Browaeys 提交于 7月 03, 2017

commit 9256645a ("net/core: relax BUILD_BUG_ON in
netdev_stats_to_stats64") made an attempt to read beyond
the size of the source a possibility.

Fix to only copy src size to dest. As dest might be bigger than src.

 ==================================================================
 BUG: KASAN: slab-out-of-bounds in netdev_stats_to_stats64+0xe/0x30 at addr ffff8801be248b20
 Read of size 192 by task VBoxNetAdpCtl/6734
 CPU: 1 PID: 6734 Comm: VBoxNetAdpCtl Tainted: G           O    4.11.4prahal+intel+ #118
 Hardware name: LENOVO 20CDCTO1WW/20CDCTO1WW, BIOS GQET52WW (1.32 ) 05/04/2017
 Call Trace:
  dump_stack+0x63/0x86
  kasan_object_err+0x1c/0x70
  kasan_report+0x270/0x520
  ? netdev_stats_to_stats64+0xe/0x30
  ? sched_clock_cpu+0x1b/0x190
  ? __module_address+0x3e/0x3b0
  ? unwind_next_frame+0x1ea/0xb00
  check_memory_region+0x13c/0x1a0
  memcpy+0x23/0x50
  netdev_stats_to_stats64+0xe/0x30
  dev_get_stats+0x1b9/0x230
  rtnl_fill_stats+0x44/0xc00
  ? nla_put+0xc6/0x130
  rtnl_fill_ifinfo+0xe9e/0x3700
  ? rtnl_fill_vfinfo+0xde0/0xde0
  ? sched_clock+0x9/0x10
  ? sched_clock+0x9/0x10
  ? sched_clock_local+0x120/0x130
  ? __module_address+0x3e/0x3b0
  ? unwind_next_frame+0x1ea/0xb00
  ? sched_clock+0x9/0x10
  ? sched_clock+0x9/0x10
  ? sched_clock_cpu+0x1b/0x190
  ? VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
  ? depot_save_stack+0x1d8/0x4a0
  ? depot_save_stack+0x34f/0x4a0
  ? depot_save_stack+0x34f/0x4a0
  ? save_stack+0xb1/0xd0
  ? save_stack_trace+0x16/0x20
  ? save_stack+0x46/0xd0
  ? kasan_slab_alloc+0x12/0x20
  ? __kmalloc_node_track_caller+0x10d/0x350
  ? __kmalloc_reserve.isra.36+0x2c/0xc0
  ? __alloc_skb+0xd0/0x560
  ? rtmsg_ifinfo_build_skb+0x61/0x120
  ? rtmsg_ifinfo.part.25+0x16/0xb0
  ? rtmsg_ifinfo+0x47/0x70
  ? register_netdev+0x15/0x30
  ? vboxNetAdpOsCreate+0xc0/0x1c0 [vboxnetadp]
  ? vboxNetAdpCreate+0x210/0x400 [vboxnetadp]
  ? VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
  ? do_vfs_ioctl+0x17f/0xff0
  ? SyS_ioctl+0x74/0x80
  ? do_syscall_64+0x182/0x390
  ? __alloc_skb+0xd0/0x560
  ? __alloc_skb+0xd0/0x560
  ? save_stack_trace+0x16/0x20
  ? init_object+0x64/0xa0
  ? ___slab_alloc+0x1ae/0x5c0
  ? ___slab_alloc+0x1ae/0x5c0
  ? __alloc_skb+0xd0/0x560
  ? sched_clock+0x9/0x10
  ? kasan_unpoison_shadow+0x35/0x50
  ? kasan_kmalloc+0xad/0xe0
  ? __kmalloc_node_track_caller+0x246/0x350
  ? __alloc_skb+0xd0/0x560
  ? kasan_unpoison_shadow+0x35/0x50
  ? memset+0x31/0x40
  ? __alloc_skb+0x31f/0x560
  ? napi_consume_skb+0x320/0x320
  ? br_get_link_af_size_filtered+0xb7/0x120 [bridge]
  ? if_nlmsg_size+0x440/0x630
  rtmsg_ifinfo_build_skb+0x83/0x120
  rtmsg_ifinfo.part.25+0x16/0xb0
  rtmsg_ifinfo+0x47/0x70
  register_netdevice+0xa2b/0xe50
  ? __kmalloc+0x171/0x2d0
  ? netdev_change_features+0x80/0x80
  register_netdev+0x15/0x30
  vboxNetAdpOsCreate+0xc0/0x1c0 [vboxnetadp]
  vboxNetAdpCreate+0x210/0x400 [vboxnetadp]
  ? vboxNetAdpComposeMACAddress+0x1d0/0x1d0 [vboxnetadp]
  ? kasan_check_write+0x14/0x20
  VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
  ? VBoxNetAdpLinuxOpen+0x20/0x20 [vboxnetadp]
  ? lock_acquire+0x11c/0x270
  ? __audit_syscall_entry+0x2fb/0x660
  do_vfs_ioctl+0x17f/0xff0
  ? __audit_syscall_entry+0x2fb/0x660
  ? ioctl_preallocate+0x1d0/0x1d0
  ? __audit_syscall_entry+0x2fb/0x660
  ? kmem_cache_free+0xb2/0x250
  ? syscall_trace_enter+0x537/0xd00
  ? exit_to_usermode_loop+0x100/0x100
  SyS_ioctl+0x74/0x80
  ? do_sys_open+0x350/0x350
  ? do_vfs_ioctl+0xff0/0xff0
  do_syscall_64+0x182/0x390
  entry_SYSCALL64_slow_path+0x25/0x25
 RIP: 0033:0x7f7e39a1ae07
 RSP: 002b:00007ffc6f04c6d8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
 RAX: ffffffffffffffda RBX: 00007ffc6f04c730 RCX: 00007f7e39a1ae07
 RDX: 00007ffc6f04c730 RSI: 00000000c0207601 RDI: 0000000000000007
 RBP: 00007ffc6f04c700 R08: 00007ffc6f04c780 R09: 0000000000000008
 R10: 0000000000000541 R11: 0000000000000206 R12: 0000000000000007
 R13: 00000000c0207601 R14: 00007ffc6f04c730 R15: 0000000000000012
 Object at ffff8801be248008, in cache kmalloc-4096 size: 4096
 Allocated:
 PID = 6734
  save_stack_trace+0x16/0x20
  save_stack+0x46/0xd0
  kasan_kmalloc+0xad/0xe0
  __kmalloc+0x171/0x2d0
  alloc_netdev_mqs+0x8a7/0xbe0
  vboxNetAdpOsCreate+0x65/0x1c0 [vboxnetadp]
  vboxNetAdpCreate+0x210/0x400 [vboxnetadp]
  VBoxNetAdpLinuxIOCtlUnlocked+0x14b/0x280 [vboxnetadp]
  do_vfs_ioctl+0x17f/0xff0
  SyS_ioctl+0x74/0x80
  do_syscall_64+0x182/0x390
  return_from_SYSCALL_64+0x0/0x6a
 Freed:
 PID = 5600
  save_stack_trace+0x16/0x20
  save_stack+0x46/0xd0
  kasan_slab_free+0x73/0xc0
  kfree+0xe4/0x220
  kvfree+0x25/0x30
  single_release+0x74/0xb0
  __fput+0x265/0x6b0
  ____fput+0x9/0x10
  task_work_run+0xd5/0x150
  exit_to_usermode_loop+0xe2/0x100
  do_syscall_64+0x26c/0x390
  return_from_SYSCALL_64+0x0/0x6a
 Memory state around the buggy address:
  ffff8801be248a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffff8801be248b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 >ffff8801be248b80: 00 00 00 00 00 00 00 00 00 00 00 07 fc fc fc fc
                                                     ^
  ffff8801be248c00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff8801be248c80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ==================================================================
Signed-off-by: NAlban Browaeys <alban.browaeys@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9af9959e

01 7月, 2017 1 次提交

net: convert sk_buff.users from atomic_t to refcount_t · 63354797

由 Reshetova, Elena 提交于 6月 30, 2017

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NDavid Windsor <dwindsor@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

63354797

30 6月, 2017 1 次提交

net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish() · e44699d2

由 Michal Kubeček 提交于 6月 29, 2017

Recently I started seeing warnings about pages with refcount -1. The
problem was traced to packets being reused after their head was merged into
a GRO packet by skb_gro_receive(). While bisecting the issue pointed to
commit c21b48cc ("net: adjust skb->truesize in ___pskb_trim()") and
I have never seen it on a kernel with it reverted, I believe the real
problem appeared earlier when the option to merge head frag in GRO was
implemented.

Handling NAPI_GRO_FREE_STOLEN_HEAD state was only added to GRO_MERGED_FREE
branch of napi_skb_finish() so that if the driver uses napi_gro_frags()
and head is merged (which in my case happens after the skb_condense()
call added by the commit mentioned above), the skb is reused including the
head that has been merged. As a result, we release the page reference
twice and eventually end up with negative page refcount.

To fix the problem, handle NAPI_GRO_FREE_STOLEN_HEAD in napi_frags_finish()
the same way it's done in napi_skb_finish().

Fixes: d7e8883c ("net: make GRO aware of skb->head_frag")
Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e44699d2

28 6月, 2017 1 次提交

net: prevent sign extension in dev_get_stats() · 6f64ec74

由 Eric Dumazet 提交于 6月 27, 2017

Similar to the fix provided by Dominik Heidler in commit
9b3dc0a1 ("l2tp: cast l2tp traffic counter to unsigned")
we need to take care of 32bit kernels in dev_get_stats().

When using atomic_long_read(), we add a 'long' to u64 and
might misinterpret high order bit, unless we cast to unsigned.

Fixes: caf586e5 ("net: add a core netdev->rx_dropped counter")
Fixes: 015f0688 ("net: net: add a core netdev->tx_dropped counter")
Fixes: 6e7333d3 ("net: add rx_nohandler stat counter")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Jarod Wilson <jarod@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6f64ec74

24 6月, 2017 3 次提交

xdp: add reporting of offload mode · ce158e58

由 Jakub Kicinski 提交于 6月 21, 2017

Extend the XDP_ATTACHED_* values to include offloaded mode.
Let drivers report whether program is installed in the driver
or the HW by changing the prog_attached field from bool to
u8 (type of the netlink attribute).

Exploit the fact that the value of XDP_ATTACHED_DRV is 1,
therefore since all drivers currently assign the mode with
double negation:
       mode = !!xdp_prog;
no drivers have to be modified.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ce158e58

xdp: add HW offload mode flag for installing programs · ee5d032f

由 Jakub Kicinski 提交于 6月 21, 2017

Add an installation-time flag for requesting that the program
be installed only if it can be offloaded to HW.

Internally new command for ndo_xdp is added, this way we avoid
putting checks into drivers since they all return -EINVAL on
an unknown command.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ee5d032f

xdp: pass XDP flags into install handlers · 32d60277

由 Jakub Kicinski 提交于 6月 21, 2017

Pass XDP flags to the xdp ndo.  This will allow drivers to look
at the mode flags and make decisions about offload.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

32d60277

21 6月, 2017 1 次提交

net/core: remove explicit do_softirq() from busy_poll_stop() · fe420d87

由 Sebastian Siewior 提交于 6月 16, 2017

Since commit 217f6974 ("net: busy-poll: allow preemption in
sk_busy_loop()") there is an explicit do_softirq() invocation after
local_bh_enable() has been invoked.
I don't understand why we need this because local_bh_enable() will
invoke do_softirq() once the softirq counter reached zero and we have
softirq-related work pending.
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fe420d87

18 6月, 2017 1 次提交

net: remove dst gc related code · 5b7c9a8f

由 Wei Wang 提交于 6月 17, 2017

This patch removes all dst gc related code and all the dst free
functions
Signed-off-by: NWei Wang <weiwan@google.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5b7c9a8f

16 6月, 2017 1 次提交

net: Add IFLA_XDP_PROG_ID · 58038695

由 Martin KaFai Lau 提交于 6月 15, 2017

Expose prog_id through IFLA_XDP_PROG_ID.  This patch
makes modification to generic_xdp.  The later patches will
modify other xdp-supported drivers.

prog_id is added to struct net_dev_xdp.

iproute2 patch will be followed. Here is how the 'ip link'
will look like:
> ip link show eth0
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp(prog_id:1) qdisc fq_codel state UP mode DEFAULT group default qlen 1000
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Acked-by: NAlexei Starovoitov <ast@fb.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

58038695

13 6月, 2017 1 次提交

net: rps: fix uninitialized symbol warning · 97d8b6e3

由 Ashwanth Goli 提交于 6月 13, 2017

This patch fixes uninitialized symbol warning that
got introduced by the following commit
773fc8f6 ("net: rps: send out pending IPI's on CPU hotplug")
Signed-off-by: NAshwanth Goli <ashwanth@codeaurora.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

97d8b6e3

10 6月, 2017 1 次提交

net: rps: send out pending IPI's on CPU hotplug · 773fc8f6

由 ashwanth@codeaurora.org 提交于 6月 09, 2017

IPI's from the victim cpu are not handled in dev_cpu_callback.
So these pending IPI's would be sent to the remote cpu only when
NET_RX is scheduled on the victim cpu and since this trigger is
unpredictable it would result in packet latencies on the remote cpu.

This patch add support to send the pending ipi's of victim cpu.
Signed-off-by: NAshwanth Goli <ashwanth@codeaurora.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

773fc8f6

08 6月, 2017 2 次提交

net: Fix inconsistent teardown and release of private netdev state. · cf124db5

由 David S. Miller 提交于 5月 08, 2017

Network devices can allocate reasources and private memory using
netdev_ops->ndo_init().  However, the release of these resources
can occur in one of two different places.

Either netdev_ops->ndo_uninit() or netdev->destructor().

The decision of which operation frees the resources depends upon
whether it is necessary for all netdev refs to be released before it
is safe to perform the freeing.

netdev_ops->ndo_uninit() presumably can occur right after the
NETDEV_UNREGISTER notifier completes and the unicast and multicast
address lists are flushed.

netdev->destructor(), on the other hand, does not run until the
netdev references all go away.

Further complicating the situation is that netdev->destructor()
almost universally does also a free_netdev().

This creates a problem for the logic in register_netdevice().
Because all callers of register_netdevice() manage the freeing
of the netdev, and invoke free_netdev(dev) if register_netdevice()
fails.

If netdev_ops->ndo_init() succeeds, but something else fails inside
of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
it is not able to invoke netdev->destructor().

This is because netdev->destructor() will do a free_netdev() and
then the caller of register_netdevice() will do the same.

However, this means that the resources that would normally be released
by netdev->destructor() will not be.

Over the years drivers have added local hacks to deal with this, by
invoking their destructor parts by hand when register_netdevice()
fails.

Many drivers do not try to deal with this, and instead we have leaks.

Let's close this hole by formalizing the distinction between what
private things need to be freed up by netdev->destructor() and whether
the driver needs unregister_netdevice() to perform the free_netdev().

netdev->priv_destructor() performs all actions to free up the private
resources that used to be freed by netdev->destructor(), except for
free_netdev().

netdev->needs_free_netdev is a boolean that indicates whether
free_netdev() should be done at the end of unregister_netdevice().

Now, register_netdevice() can sanely release all resources after
ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
and netdev->priv_destructor().

And at the end of unregister_netdevice(), we invoke
netdev->priv_destructor() and optionally call free_netdev().
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cf124db5

net: don't call strlen on non-terminated string in dev_set_alias() · c28294b9

由 Alexander Potapenko 提交于 6月 06, 2017

KMSAN reported a use of uninitialized memory in dev_set_alias(),
which was caused by calling strlcpy() (which in turn called strlen())
on the user-supplied non-terminated string.
Signed-off-by: NAlexander Potapenko <glider@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c28294b9

07 6月, 2017 1 次提交

net: sched: introduce a TRAP control action · e25ea21f

由 Jiri Pirko 提交于 6月 06, 2017

There is need to instruct the HW offloaded path to push certain matched
packets to cpu/kernel for further analysis. So this patch introduces a
new TRAP control action to TC.

For kernel datapath, this action does not make much sense. So with the
same logic as in HW, new TRAP behaves similar to STOLEN. The skb is just
dropped in the datapath (and virtually ejected to an upper level, which
does not exist in case of kernel).
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NYotam Gigi <yotamg@mellanox.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e25ea21f

28 5月, 2017 1 次提交

rtnl: Add support for netdev event to link messages · 3d3ea5af

由 Vlad Yasevich 提交于 5月 27, 2017

When netdev events happen, a rtnetlink_event() handler will send
messages for every event in it's white list.  These messages contain
current information about a particular device, but they do not include
the iformation about which event just happened.  So, it is impossible
to tell what just happend for these events.

This patch adds a new extension to RTM_NEWLINK message called IFLA_EVENT
that would have an encoding of event that triggered this
message.  This would allow the the message consumer to easily determine
if it needs to perform certain actions.
Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
Acked-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3d3ea5af

22 5月, 2017 1 次提交

net: add function to retrieve original skb device using NAPI ID · 90b602f8

由 Miroslav Lichvar 提交于 5月 19, 2017

Since commit b6858177 ("net: Make skb->skb_iif always track
skb->dev") skbs don't have the original index of the interface which
received the packet. This information is now needed for a new control
message related to hardware timestamping.

Instead of adding a new field to skb, we can find the device by the NAPI
ID if it is available, i.e. CONFIG_NET_RX_BUSY_POLL is enabled and the
driver is using NAPI. Add dev_get_by_napi_id() and also skb_napi_id() to
hide the CONFIG_NET_RX_BUSY_POLL ifdef.

CC: Richard Cochran <richardcochran@gmail.com>
Suggested-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NMiroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

90b602f8

20 5月, 2017 4 次提交

net: more accurate checksumming in validate_xmit_skb() · 43c26a1a

由 Davide Caratti 提交于 5月 18, 2017

skb_csum_hwoffload_help() uses netdev features and skb->csum_not_inet to
determine if skb needs software computation of Internet Checksum or crc32c
(or nothing, if this computation can be done by the hardware). Use it in
place of skb_checksum_help() in validate_xmit_skb() to avoid corruption
of non-GSO SCTP packets having skb->ip_summed equal to CHECKSUM_PARTIAL.

While at it, remove references to skb_csum_off_chk* functions, since they
are not present anymore in Linux  _ see commit cf53b1da ("Revert
 "net: Add driver helper functions to determine checksum offloadability"").
Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

43c26a1a

net: use skb->csum_not_inet to identify packets needing crc32c · dba00306

由 Davide Caratti 提交于 5月 18, 2017

skb->csum_not_inet carries the indication on which algorithm is needed to
compute checksum on skb in the transmit path, when skb->ip_summed is equal
to CHECKSUM_PARTIAL. If skb carries a SCTP packet and crc32c hasn't been
yet written in L4 header, skb->csum_not_inet is assigned to 1; otherwise,
assume Internet Checksum is needed and thus set skb->csum_not_inet to 0.
Suggested-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
Acked-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dba00306

sk_buff: remove support for csum_bad in sk_buff · 219f1d79

由 Davide Caratti 提交于 5月 18, 2017

This bit was introduced with commit 5a212329 ("net: Support for
csum_bad in skbuff") to reduce the stack workload when processing RX
packets carrying a wrong Internet Checksum. Up to now, only one driver and
GRO core are setting it.
Suggested-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

219f1d79

net: introduce skb_crc32c_csum_help · b72b5bf6

由 Davide Caratti 提交于 5月 18, 2017

skb_crc32c_csum_help is like skb_checksum_help, but it is designed for
checksumming SCTP packets using crc32c (see RFC3309), provided that
libcrc32c.ko has been loaded before. In case libcrc32c is not loaded,
invoking skb_crc32c_csum_help on a skb results in one the following
printouts:

warn_crc32c_csum_update: attempt to compute crc32c without libcrc32c.ko
warn_crc32c_csum_combine: attempt to compute crc32c without libcrc32c.ko
Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b72b5bf6

18 5月, 2017 1 次提交

net: sched: move tc_classify function to cls_api.c · 87d83093

由 Jiri Pirko 提交于 5月 17, 2017

Move tc_classify function to cls_api.c where it belongs, rename it to
fit the namespace.
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

87d83093

12 5月, 2017 2 次提交

xdp: refine xdp api with regards to generic xdp · d67b9cd2

由 Daniel Borkmann 提交于 5月 12, 2017

While working on the iproute2 generic XDP frontend, I noticed that
as of right now it's possible to have native *and* generic XDP
programs loaded both at the same time for the case when a driver
supports native XDP.

The intended model for generic XDP from b5cdae32 ("net: Generic
XDP") is, however, that only one out of the two can be present at
once which is also indicated as such in the XDP netlink dump part.
The main rationale for generic XDP is to ease accessibility (in
case a driver does not yet have XDP support) and to generically
provide a semantical model as an example for driver developers
wanting to add XDP support. The generic XDP option for an XDP
aware driver can still be useful for comparing and testing both
implementations.

However, it is not intended to have a second XDP processing stage
or layer with exactly the same functionality of the first native
stage. Only reason could be to have a partial fallback for future
XDP features that are not supported yet in the native implementation
and we probably also shouldn't strive for such fallback and instead
encourage native feature support in the first place. Given there's
currently no such fallback issue or use case, lets not go there yet
if we don't need to.

Therefore, change semantics for loading XDP and bail out if the
user tries to load a generic XDP program when a native one is
present and vice versa. Another alternative to bailing out would
be to handle the transition from one flavor to another gracefully,
but that would require to bring the device down, exchange both
types of programs, and bring it up again in order to avoid a tiny
window where a packet could hit both hooks. Given this complicates
the logic for just a debugging feature in the native case, I went
with the simpler variant.

For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae32
and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
or just a subset of flags that were used for loading the XDP prog
is suboptimal in the long run since not all flags are useful for
dumping and if we start to reuse the same flag definitions for
load and dump, then we'll waste bit space. What we really just
want is to dump the mode for now.

Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
a program is running at the native driver layer (1). Thus, add a
mode that says that a program is running at generic XDP layer (2).
Applications will handle this fine in that older binaries will
just indicate that something is attached at XDP layer, effectively
this is similar to IFLA_XDP_FLAGS attr that we would have had
modulo the redundancy.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d67b9cd2

xdp: add flag to enforce driver mode · 0489df9a

由 Daniel Borkmann 提交于 5月 12, 2017

After commit b5cdae32 ("net: Generic XDP") we automatically fall
back to a generic XDP variant if the driver does not support native
XDP. Allow for an option where the user can specify that always the
native XDP variant should be selected and in case it's not supported
by a driver, just bail out.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0489df9a

09 5月, 2017 2 次提交

treewide: convert PF_MEMALLOC manipulations to new helpers · f1083048

由 Vlastimil Babka 提交于 5月 08, 2017

We now have memalloc_noreclaim_{save,restore} helpers for robust setting
and clearing of PF_MEMALLOC.  Let's convert the code which was using the
generic tsk_restore_flags().  No functional change.

[vbabka@suse.cz: in net/core/sock.c the hunk is missing]
Link: http://lkml.kernel.org/r/20170405074700.29871-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Lee Duncan <lduncan@suse.com>
Cc: Chris Leech <cleech@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Wouter Verhelst <w@uter.be>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f1083048

net: use kvmalloc with __GFP_REPEAT rather than open coded variant · da6bc57a

由 Michal Hocko 提交于 5月 08, 2017

fq_alloc_node, alloc_netdev_mqs and netif_alloc* open code kmalloc with
vmalloc fallback.  Use the kvmalloc variant instead.  Keep the
__GFP_REPEAT flag based on explanation from Eric:

 "At the time, tests on the hardware I had in my labs showed that
  vmalloc() could deliver pages spread all over the memory and that was
  a small penalty (once memory is fragmented enough, not at boot time)"

The way how the code is constructed means, however, that we prefer to go
and hit the OOM killer before we fall back to the vmalloc for requests
<=32kB (with 4kB pages) in the current code.  This is rather disruptive
for something that can be achived with the fallback.  On the other hand
__GFP_REPEAT doesn't have any useful semantic for these requests.  So
the effect of this patch is that requests which fit into 32kB will fall
back to vmalloc easier now.

Link: http://lkml.kernel.org/r/20170306103327.2766-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Eric Dumazet <edumazet@google.com>
Cc: David Miller <davem@davemloft.net>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

da6bc57a

02 5月, 2017 1 次提交

xdp: fix parameter kdoc for extack · b5d60989

由 Jakub Kicinski 提交于 5月 01, 2017

Fix kdoc parameter spelling from extact to extack.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b5d60989

01 5月, 2017 1 次提交

xdp: propagate extended ack to XDP setup · ddf9f970

由 Jakub Kicinski 提交于 4月 30, 2017

Drivers usually have a number of restrictions for running XDP
- most common being buffer sizes, LRO and number of rings.
Even though some drivers try to be helpful and print error
messages experience shows that users don't often consult
kernel logs on netlink errors.  Try to use the new extended
ack mechanism to carry the message back to user space.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ddf9f970

28 4月, 2017 1 次提交

net: remove unnecessary carrier status check · 0575c86b

由 Zhang Shengju 提交于 4月 26, 2017

Since netif_carrier_on() will do nothing if device's carrier is already
on, so it's unnecessary to do carrier status check.

It's the same for netif_carrier_off().
Signed-off-by: NZhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0575c86b

27 4月, 2017 1 次提交

net: core: Prevent from dereferencing null pointer when releasing SKB · 9899886d

由 Myungho Jung 提交于 4月 25, 2017

Added NULL check to make __dev_kfree_skb_irq consistent with kfree
family of functions.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=195289Signed-off-by: NMyungho Jung <mhjungk@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9899886d

26 4月, 2017 1 次提交

net: Generic XDP · b5cdae32

由 David S. Miller 提交于 4月 18, 2017

This provides a generic SKB based non-optimized XDP path which is used
if either the driver lacks a specific XDP implementation, or the user
requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE.

It is arguable that perhaps I should have required something like
this as part of the initial XDP feature merge.

I believe this is critical for two reasons:

1) Accessibility.  More people can play with XDP with less
   dependencies.  Yes I know we have XDP support in virtio_net, but
   that just creates another depedency for learning how to use this
   facility.

   I wrote this to make life easier for the XDP newbies.

2) As a model for what the expected semantics are.  If there is a pure
   generic core implementation, it serves as a semantic example for
   driver folks adding XDP support.

One thing I have not tried to address here is the issue of
XDP_PACKET_HEADROOM, thanks to Daniel for spotting that.  It seems
incredibly expensive to do a skb_cow(skb, XDP_PACKET_HEADROOM) or
whatever even if the XDP program doesn't try to push headers at all.
I think we really need the verifier to somehow propagate whether
certain XDP helpers are used or not.

v5:
 - Handle both negative and positive offset after running prog
 - Fix mac length in XDP_TX case (Alexei)
 - Use rcu_dereference_protected() in free_netdev (kbuild test robot)

v4:
 - Fix MAC header adjustmnet before calling prog (David Ahern)
 - Disable LRO when generic XDP is installed (Michael Chan)
 - Bypass qdisc et al. on XDP_TX and record the event (Alexei)
 - Do not perform generic XDP on reinjected packets (DaveM)

v3:
 - Make sure XDP program sees packet at MAC header, push back MAC
   header if we do XDP_TX.  (Alexei)
 - Elide GRO when generic XDP is in use.  (Alexei)
 - Add XDP_FLAG_SKB_MODE flag which the user can use to request generic
   XDP even if the driver has an XDP implementation.  (Alexei)
 - Report whether SKB mode is in use in rtnl_xdp_fill() via XDP_FLAGS
   attribute.  (Daniel)

v2:
 - Add some "fall through" comments in switch statements based
   upon feedback from Andrew Lunn
 - Use RCU for generic xdp_prog, thanks to Johannes Berg.
Tested-by: NAndy Gospodarek <andy@greyhouse.net>
Tested-by: NJesper Dangaard Brouer <brouer@redhat.com>
Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b5cdae32

22 4月, 2017 1 次提交

Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning · 7acf8a1e

由 Matthew Whitehead 提交于 4月 19, 2017

Constants used for tuning are generally a bad idea, especially as hardware
changes over time. Replace the constant 2 jiffies with sysctl variable
netdev_budget_usecs to enable sysadmins to tune the softirq processing.
Also document the variable.

For example, a very fast machine might tune this to 1000 microseconds,
while my regression testing 486DX-25 needs it to be 4000 microseconds on
a nearly idle network to prevent time_squeeze from being incremented.

Version 2: changed jiffies to microseconds for predictable units.
Signed-off-by: NMatthew Whitehead <tedheadster@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7acf8a1e

14 4月, 2017 1 次提交

net: Add a xfrm validate function to validate_xmit_skb · f6e27114

由 Steffen Klassert 提交于 4月 14, 2017

When we do IPsec offloading, we need a fallback for
packets that were targeted to be IPsec offloaded but
rerouted to a device that does not support IPsec offload.
For that we add a function that checks the offloading
features of the sending device and and flags the
requirement of a fallback before it calls the IPsec
output function. The IPsec output function adds the IPsec
trailer and does encryption if needed.
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

f6e27114

12 4月, 2017 1 次提交

net: xdp: don't export dev_change_xdp_fd() · df7dd8fc

由 Johannes Berg 提交于 4月 12, 2017

Since dev_change_xdp_fd() is only used in rtnetlink, which must
be built-in, there's no reason to export dev_change_xdp_fd().
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

df7dd8fc

11 4月, 2017 1 次提交

sched/core: Remove 'task' parameter and rename tsk_restore_flags() to current_restore_flags() · 717a94b5

由 NeilBrown 提交于 4月 07, 2017

It is not safe for one thread to modify the ->flags
of another thread as there is no locking that can protect
the update.

So tsk_restore_flags(), which takes a task pointer and modifies
the flags, is an invitation to do the wrong thing.

All current users pass "current" as the task, so no developers have
accepted that invitation.  It would be best to ensure it remains
that way.

So rename tsk_restore_flags() to current_restore_flags() and don't
pass in a task_struct pointer.  Always operate on current->flags.
Signed-off-by: NNeilBrown <neilb@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

717a94b5

10 4月, 2017 1 次提交

Revert "rtnl: Add support for netdev event to link messages" · bf74b20d

由 David S. Miller 提交于 4月 09, 2017

This reverts commit def12888.

As per discussion between Roopa Prabhu and David Ahern, it is
advisable that we instead have the code collect the setlink triggered
events into a bitmask emitted in the IFLA_EVENT netlink attribute.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bf74b20d

05 4月, 2017 1 次提交

rtnl: Add support for netdev event to link messages · def12888

由 Vlad Yasevich 提交于 4月 04, 2017

When netdev events happen, a rtnetlink_event() handler will send
messages for every event in it's white list.  These messages contain
current information about a particular device, but they do not include
the iformation about which event just happened.  The consumer of
the message has to try to infer this information.  In some cases
(ex: NETDEV_NOTIFY_PEERS), that is not possible.

This patch adds a new extension to RTM_NEWLINK message called IFLA_EVENT
that would have an encoding of the which event triggered this
message.  This would allow the the message consumer to easily determine
if it is interested in a particular event or not.
Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

def12888

25 3月, 2017 1 次提交

net: Commonize busy polling code to focus on napi_id instead of socket · 7db6b048

由 Sridhar Samudrala 提交于 3月 24, 2017

Move the core functionality in sk_busy_loop() to napi_busy_loop() and
make it independent of sk.

This enables re-using this function in epoll busy loop implementation.
Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7db6b048

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功