1. 18 5月, 2017 12 次提交
  2. 17 5月, 2017 4 次提交
    • E
      tcp: internal implementation for pacing · 218af599
      Eric Dumazet 提交于
      BBR congestion control depends on pacing, and pacing is
      currently handled by sch_fq packet scheduler for performance reasons,
      and also because implemening pacing with FQ was convenient to truly
      avoid bursts.
      
      However there are many cases where this packet scheduler constraint
      is not practical.
      - Many linux hosts are not focusing on handling thousands of TCP
        flows in the most efficient way.
      - Some routers use fq_codel or other AQM, but still would like
        to use BBR for the few TCP flows they initiate/terminate.
      
      This patch implements an automatic fallback to internal pacing.
      
      Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.
      
      If sch_fq happens to be in the egress path, pacing is delegated to
      the qdisc, otherwise pacing is done by TCP itself.
      
      One advantage of pacing from TCP stack is to get more precise rtt
      estimations, and less work done from TX completion, since TCP Small
      queue limits are not generally hit. Setups with single TX queue but
      many cpus might even benefit from this.
      
      Note that unlike sch_fq, we do not take into account header sizes.
      Taking care of these headers would add additional complexity for
      no practical differences in behavior.
      
      Some performance numbers using 800 TCP_STREAM flows rate limited to
      ~48 Mbit per second on 40Gbit NIC.
      
      If MQ+pfifo_fast is used on the NIC :
      
      $ sar -n DEV 1 5 | grep eth
      14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
      14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
      14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
      14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
      14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
      Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
       4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
       0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
       1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
       1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0
      
      Now use MQ+FQ :
      
      lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
      lpaa23:~# tc qdisc replace dev eth0 root mq
      
      $ sar -n DEV 1 5 | grep eth
      14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
      14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
      14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
      14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
      14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
      Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
       1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
       0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
       4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
       3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0
      
      As expected, number of interrupts per second is very different.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      218af599
    • P
      udp: keep the sk_receive_queue held when splicing · 6dfb4367
      Paolo Abeni 提交于
      On packet reception, when we are forced to splice the
      sk_receive_queue, we can keep the related lock held, so
      that we can avoid re-acquiring it, if fwd memory
      scheduling is required.
      
      v1 -> v2:
        the rx_queue_lock_held param in udp_rmem_release() is
        now a bool
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6dfb4367
    • P
      udp: use a separate rx queue for packet reception · 2276f58a
      Paolo Abeni 提交于
      under udp flood the sk_receive_queue spinlock is heavily contended.
      This patch try to reduce the contention on such lock adding a
      second receive queue to the udp sockets; recvmsg() looks first
      in such queue and, only if empty, tries to fetch the data from
      sk_receive_queue. The latter is spliced into the newly added
      queue every time the receive path has to acquire the
      sk_receive_queue lock.
      
      The accounting of forward allocated memory is still protected with
      the sk_receive_queue lock, so udp_rmem_release() needs to acquire
      both locks when the forward deficit is flushed.
      
      On specific scenarios we can end up acquiring and releasing the
      sk_receive_queue lock multiple times; that will be covered by
      the next patch
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2276f58a
    • P
      net/sock: factor out dequeue/peek with offset code · 65101aec
      Paolo Abeni 提交于
      And update __sk_queue_drop_skb() to work on the specified queue.
      This will help the udp protocol to use an additional private
      rx queue in a later patch.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65101aec
  3. 16 5月, 2017 3 次提交
  4. 12 5月, 2017 8 次提交
    • X
      sctp: fix src address selection if using secondary addresses for ipv6 · dbc2b5e9
      Xin Long 提交于
      Commit 0ca50d12 ("sctp: fix src address selection if using secondary
      addresses") has fixed a src address selection issue when using secondary
      addresses for ipv4.
      
      Now sctp ipv6 also has the similar issue. When using a secondary address,
      sctp_v6_get_dst tries to choose the saddr which has the most same bits
      with the daddr by sctp_v6_addr_match_len. It may make some cases not work
      as expected.
      
      hostA:
        [1] fd21:356b:459a:cf10::11 (eth1)
        [2] fd21:356b:459a:cf20::11 (eth2)
      
      hostB:
        [a] fd21:356b:459a:cf30::2  (eth1)
        [b] fd21:356b:459a:cf40::2  (eth2)
      
      route from hostA to hostB:
        fd21:356b:459a:cf30::/64 dev eth1  metric 1024  mtu 1500
      
      The expected path should be:
        fd21:356b:459a:cf10::11 <-> fd21:356b:459a:cf30::2
      But addr[2] matches addr[a] more bits than addr[1] does, according to
      sctp_v6_addr_match_len. It causes the path to be:
        fd21:356b:459a:cf20::11 <-> fd21:356b:459a:cf30::2
      
      This patch is to fix it with the same way as Marcelo's fix for sctp ipv4.
      As no ip_dev_find for ipv6, this patch is to use ipv6_chk_addr to check
      if the saddr is in a dev instead.
      
      Note that for backwards compatibility, it will still do the addr_match_len
      check here when no optimal is found.
      Reported-by: NPatrick Talbert <ptalbert@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbc2b5e9
    • J
      tipc: make macro tipc_wait_for_cond() smp safe · 844cf763
      Jon Paul Maloy 提交于
      The macro tipc_wait_for_cond() is embedding the macro sk_wait_event()
      to fulfil its task. The latter, in turn, is evaluating the stated
      condition outside the socket lock context. This is problematic if
      the condition is accessing non-trivial data structures which may be
      altered by incoming interrupts, as is the case with the cong_links()
      linked list, used by socket to keep track of the current set of
      congested links. We sometimes see crashes when this list is accessed
      by a condition function at the same time as a SOCK_WAKEUP interrupt
      is removing an element from the list.
      
      We fix this by expanding selected parts of sk_wait_event() into the
      outer macro, while ensuring that all evaluations of a given condition
      are performed under socket lock protection.
      
      Fixes: commit 365ad353 ("tipc: reduce risk of user starvation during link congestion")
      Reviewed-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      844cf763
    • E
      net: sched: optimize class dumps · cb395b20
      Eric Dumazet 提交于
      In commit 59cc1f61 ("net: sched: convert qdisc linked list to
      hashtable") we missed the opportunity to considerably speed up
      tc_dump_tclass_root() if a qdisc handle is provided by user.
      
      Instead of iterating all the qdiscs, use qdisc_match_from_root()
      to directly get the one we look for.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb395b20
    • Y
      tcp: avoid fragmenting peculiar skbs in SACK · b451e5d2
      Yuchung Cheng 提交于
      This patch fixes a bug in splitting an SKB during SACK
      processing. Specifically if an skb contains multiple
      packets and is only partially sacked in the higher sequences,
      tcp_match_sack_to_skb() splits the skb and marks the second fragment
      as SACKed.
      
      The current code further attempts rounding up the first fragment
      to MSS boundaries. But it misses a boundary condition when the
      rounded-up fragment size (pkt_len) is exactly skb size.  Spliting
      such an skb is pointless and causses a kernel warning and aborts
      the SACK processing. This patch universally checks such over-split
      before calling tcp_fragment to prevent these unnecessary warnings.
      
      Fixes: adb92db8 ("tcp: Make SACK code to split only at mss boundaries")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b451e5d2
    • E
      netem: fix skb_orphan_partial() · f6ba8d33
      Eric Dumazet 提交于
      I should have known that lowering skb->truesize was dangerous :/
      
      In case packets are not leaving the host via a standard Ethernet device,
      but looped back to local sockets, bad things can happen, as reported
      by Michael Madsen ( https://bugzilla.kernel.org/show_bug.cgi?id=195713 )
      
      So instead of tweaking skb->truesize, lets change skb->destructor
      and keep a reference on the owner socket via its sk_refcnt.
      
      Fixes: f2f872f9 ("netem: Introduce skb_orphan_partial() helper")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMichael Madsen <mkm@nabto.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6ba8d33
    • D
      xdp: refine xdp api with regards to generic xdp · d67b9cd2
      Daniel Borkmann 提交于
      While working on the iproute2 generic XDP frontend, I noticed that
      as of right now it's possible to have native *and* generic XDP
      programs loaded both at the same time for the case when a driver
      supports native XDP.
      
      The intended model for generic XDP from b5cdae32 ("net: Generic
      XDP") is, however, that only one out of the two can be present at
      once which is also indicated as such in the XDP netlink dump part.
      The main rationale for generic XDP is to ease accessibility (in
      case a driver does not yet have XDP support) and to generically
      provide a semantical model as an example for driver developers
      wanting to add XDP support. The generic XDP option for an XDP
      aware driver can still be useful for comparing and testing both
      implementations.
      
      However, it is not intended to have a second XDP processing stage
      or layer with exactly the same functionality of the first native
      stage. Only reason could be to have a partial fallback for future
      XDP features that are not supported yet in the native implementation
      and we probably also shouldn't strive for such fallback and instead
      encourage native feature support in the first place. Given there's
      currently no such fallback issue or use case, lets not go there yet
      if we don't need to.
      
      Therefore, change semantics for loading XDP and bail out if the
      user tries to load a generic XDP program when a native one is
      present and vice versa. Another alternative to bailing out would
      be to handle the transition from one flavor to another gracefully,
      but that would require to bring the device down, exchange both
      types of programs, and bring it up again in order to avoid a tiny
      window where a packet could hit both hooks. Given this complicates
      the logic for just a debugging feature in the native case, I went
      with the simpler variant.
      
      For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae32
      and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
      or just a subset of flags that were used for loading the XDP prog
      is suboptimal in the long run since not all flags are useful for
      dumping and if we start to reuse the same flag definitions for
      load and dump, then we'll waste bit space. What we really just
      want is to dump the mode for now.
      
      Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
      a program is running at the native driver layer (1). Thus, add a
      mode that says that a program is running at generic XDP layer (2).
      Applications will handle this fine in that older binaries will
      just indicate that something is attached at XDP layer, effectively
      this is similar to IFLA_XDP_FLAGS attr that we would have had
      modulo the redundancy.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d67b9cd2
    • D
      xdp: add flag to enforce driver mode · 0489df9a
      Daniel Borkmann 提交于
      After commit b5cdae32 ("net: Generic XDP") we automatically fall
      back to a generic XDP variant if the driver does not support native
      XDP. Allow for an option where the user can specify that always the
      native XDP variant should be selected and in case it's not supported
      by a driver, just bail out.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0489df9a
    • W
      ipv6/dccp: do not inherit ipv6_mc_list from parent · 83eaddab
      WANG Cong 提交于
      Like commit 657831ff ("dccp/tcp: do not inherit mc_list from parent")
      we should clear ipv6_mc_list etc. for IPv6 sockets too.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83eaddab
  5. 10 5月, 2017 1 次提交
  6. 09 5月, 2017 12 次提交
    • K
      DECnet: Use container_of() for embedded struct · f92ceb01
      Kees Cook 提交于
      Instead of a direct cross-type cast, use conatiner_of() to locate
      the embedded structure, even in the face of future struct layout
      randomization.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f92ceb01
    • D
      Revert "ipv4: restore rt->fi for reference counting" · 32f1bc0f
      David S. Miller 提交于
      This reverts commit 82486aa6.
      
      As implemented, this causes dangling netdevice refs.
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32f1bc0f
    • V
      treewide: convert PF_MEMALLOC manipulations to new helpers · f1083048
      Vlastimil Babka 提交于
      We now have memalloc_noreclaim_{save,restore} helpers for robust setting
      and clearing of PF_MEMALLOC.  Let's convert the code which was using the
      generic tsk_restore_flags().  No functional change.
      
      [vbabka@suse.cz: in net/core/sock.c the hunk is missing]
      Link: http://lkml.kernel.org/r/20170405074700.29871-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Lee Duncan <lduncan@suse.com>
      Cc: Chris Leech <cleech@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Wouter Verhelst <w@uter.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1083048
    • D
      fs: ceph: CURRENT_TIME with ktime_get_real_ts() · 1134e091
      Deepa Dinamani 提交于
      CURRENT_TIME is not y2038 safe.  The macro will be deleted and all the
      references to it will be replaced by ktime_get_* apis.
      
      struct timespec is also not y2038 safe.  Retain timespec for timestamp
      representation here as ceph uses it internally everywhere.  These
      references will be changed to use struct timespec64 in a separate patch.
      
      The current_fs_time() api is being changed to use vfs struct inode* as
      an argument instead of struct super_block*.
      
      Set the new mds client request r_stamp field using ktime_get_real_ts()
      instead of using current_fs_time().
      
      Also, since r_stamp is used as mtime on the server, use timespec_trunc()
      to truncate the timestamp, using the right granularity from the
      superblock.
      
      This api will be transitioned to be y2038 safe along with vfs.
      
      Link: http://lkml.kernel.org/r/1491613030-11599-5-git-send-email-deepa.kernel@gmail.comSigned-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      M:	Ilya Dryomov <idryomov@gmail.com>
      M:	"Yan, Zheng" <zyan@redhat.com>
      M:	Sage Weil <sage@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1134e091
    • K
      format-security: move static strings to const · 06324664
      Kees Cook 提交于
      While examining output from trial builds with -Wformat-security enabled,
      many strings were found that should be defined as "const", or as a char
      array instead of char pointer.  This makes some static analysis easier,
      by producing fewer false positives.
      
      As these are all trivial changes, it seemed best to put them all in a
      single patch rather than chopping them up per maintainer.
      
      Link: http://lkml.kernel.org/r/20170405214711.GA5711@beastSigned-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: Jes Sorensen <jes@trained-monkey.org>	[runner.c]
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Sean Paul <seanpaul@chromium.org>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
      Cc: Salil Mehta <salil.mehta@huawei.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Patrice Chotard <patrice.chotard@st.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Matt Redfearn <matt.redfearn@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Mugunthan V N <mugunthanvnm@ti.com>
      Cc: Felipe Balbi <felipe.balbi@linux.intel.com>
      Cc: Jarod Wilson <jarod@redhat.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Antonio Quartulli <a@unstable.cc>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Kejian Yan <yankejian@huawei.com>
      Cc: Daode Huang <huangdaode@hisilicon.com>
      Cc: Qianqian Xie <xieqianqian@huawei.com>
      Cc: Philippe Reynes <tremyfr@gmail.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Christian Gromm <christian.gromm@microchip.com>
      Cc: Andrey Shvetsov <andrey.shvetsov@k2l.de>
      Cc: Jason Litzinger <jlitzingerdev@gmail.com>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06324664
    • M
      mm, vmalloc: use __GFP_HIGHMEM implicitly · 19809c2d
      Michal Hocko 提交于
      __vmalloc* allows users to provide gfp flags for the underlying
      allocation.  This API is quite popular
      
        $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
        77
      
      The only problem is that many people are not aware that they really want
      to give __GFP_HIGHMEM along with other flags because there is really no
      reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
      which are mapped to the kernel vmalloc space.  About half of users don't
      use this flag, though.  This signals that we make the API unnecessarily
      too complex.
      
      This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
      be mapped to the vmalloc space.  Current users which add __GFP_HIGHMEM
      are simplified and drop the flag.
      
      Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Cristopher Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19809c2d
    • M
      net: use kvmalloc with __GFP_REPEAT rather than open coded variant · da6bc57a
      Michal Hocko 提交于
      fq_alloc_node, alloc_netdev_mqs and netif_alloc* open code kmalloc with
      vmalloc fallback.  Use the kvmalloc variant instead.  Keep the
      __GFP_REPEAT flag based on explanation from Eric:
      
       "At the time, tests on the hardware I had in my labs showed that
        vmalloc() could deliver pages spread all over the memory and that was
        a small penalty (once memory is fragmented enough, not at boot time)"
      
      The way how the code is constructed means, however, that we prefer to go
      and hit the OOM killer before we fall back to the vmalloc for requests
      <=32kB (with 4kB pages) in the current code.  This is rather disruptive
      for something that can be achived with the fallback.  On the other hand
      __GFP_REPEAT doesn't have any useful semantic for these requests.  So
      the effect of this patch is that requests which fit into 32kB will fall
      back to vmalloc easier now.
      
      Link: http://lkml.kernel.org/r/20170306103327.2766-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da6bc57a
    • M
      treewide: use kv[mz]alloc* rather than opencoded variants · 752ade68
      Michal Hocko 提交于
      There are many code paths opencoding kvmalloc.  Let's use the helper
      instead.  The main difference to kvmalloc is that those users are
      usually not considering all the aspects of the memory allocator.  E.g.
      allocation requests <= 32kB (with 4kB pages) are basically never failing
      and invoke OOM killer to satisfy the allocation.  This sounds too
      disruptive for something that has a reasonable fallback - the vmalloc.
      On the other hand those requests might fallback to vmalloc even when the
      memory allocator would succeed after several more reclaim/compaction
      attempts previously.  There is no guarantee something like that happens
      though.
      
      This patch converts many of those places to kv[mz]alloc* helpers because
      they are more conservative.
      
      Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
      Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
      Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
      Acked-by: David Sterba <dsterba@suse.com> # btrfs
      Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
      Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Santosh Raspatur <santosh@chelsio.com>
      Cc: Hariprasad S <hariprasad@chelsio.com>
      Cc: Yishai Hadas <yishaih@mellanox.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: "Yan, Zheng" <zyan@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      752ade68
    • M
      net/ipv6/ila/ila_xlat.c: simplify a strange allocation pattern · 847f716f
      Michal Hocko 提交于
      alloc_ila_locks seemed to c&p from alloc_bucket_locks allocation pattern
      which is quite unusual.  The default allocation size is 320 *
      sizeof(spinlock_t) which is sub page unless lockdep is enabled when the
      performance benefit is really questionable and not worth the subtle code
      IMHO.  Also note that the context when we call ila_init_net (modprobe or
      a task creating a net namespace) has to be properly configured.
      
      Let's just simplify the code and use kvmalloc helper which is a
      transparent way to use kmalloc with vmalloc fallback.
      
      Link: http://lkml.kernel.org/r/20170306103032.2540-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      847f716f
    • W
      ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf · 242d3a49
      WANG Cong 提交于
      For each netns (except init_net), we initialize its null entry
      in 3 places:
      
      1) The template itself, as we use kmemdup()
      2) Code around dst_init_metrics() in ip6_route_net_init()
      3) ip6_route_dev_notify(), which is supposed to initialize it after
         loopback registers
      
      Unfortunately the last one still happens in a wrong order because
      we expect to initialize net->ipv6.ip6_null_entry->rt6i_idev to
      net->loopback_dev's idev, thus we have to do that after we add
      idev to loopback. However, this notifier has priority == 0 same as
      ipv6_dev_notf, and ipv6_dev_notf is registered after
      ip6_route_dev_notifier so it is called actually after
      ip6_route_dev_notifier. This is similar to commit 2f460933
      ("ipv6: initialize route null entry in addrconf_init()") which
      fixes init_net.
      
      Fix it by picking a smaller priority for ip6_route_dev_notifier.
      Also, we have to release the refcnt accordingly when unregistering
      loopback_dev because device exit functions are called before subsys
      exit functions.
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Tested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      242d3a49
    • H
      vti: check nla_put_* return value · 8ed508fd
      Hangbin Liu 提交于
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ed508fd
    • V
      vlan: Keep NETIF_F_HW_CSUM similar to other software devices · 8403debe
      Vlad Yasevich 提交于
      Vlan devices, like all other software devices, enable
      NETIF_F_HW_CSUM feature.  However, unlike all the othe other
      software devices, vlans will switch to using IP|IPV6_CSUM
      features, if the underlying devices uses them.  In these situations,
      checksum offload features on the vlan device can't be controlled
      via ethtool.
      
      This patch makes vlans keep HW_CSUM feature if the underlying
      device supports checksum offloading.  This makes vlan devices
      behave like other software devices, and restores control to the
      user.
      
      A side-effect is that some offload settings (typically UFO)
      may be enabled on the vlan device while being disabled on the HW.
      However, the GSO code will correctly process the packets. This
      actually results in slightly better raw throughput.
      Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8403debe