1. 16 12月, 2015 3 次提交
    • T
      net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK · a188222b
      Tom Herbert 提交于
      The name NETIF_F_ALL_CSUM is a misnomer. This does not correspond to the
      set of features for offloading all checksums. This is a mask of the
      checksum offload related features bits. It is incorrect to set both
      NETIF_F_HW_CSUM and NETIF_F_IP_CSUM or NETIF_F_IPV6 at the same time for
      features of a device.
      
      This patch:
        - Changes instances of NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK (where
          NETIF_F_ALL_CSUM is being used as a mask).
        - Changes bonding, sfc/efx, ipvlan, macvlan, vlan, and team drivers to
          use NEITF_F_HW_CSUM in features list instead of NETIF_F_ALL_CSUM.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a188222b
    • T
      sctp: Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC · 53692b1d
      Tom Herbert 提交于
      The SCTP checksum is really a CRC and is very different from the
      standards 1's complement checksum that serves as the checksum
      for IP protocols. This offload interface is also very different.
      Rename NETIF_F_SCTP_CSUM to NETIF_F_SCTP_CRC to highlight these
      differences. The term CSUM should be reserved in the stack to refer
      to the standard 1's complement IP checksum.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53692b1d
    • I
      switchdev: Pass original device to port netdev driver · 6ff64f6f
      Ido Schimmel 提交于
      switchdev drivers need to know the netdev on which the switchdev op was
      invoked. For example, the STP state of a VLAN interface configured on top
      of a port can change while being member in a bridge. In this case, the
      underlying driver should only change the STP state of that particular
      VLAN and not of all the VLANs configured on the port.
      
      However, current switchdev infrastructure only passes the port netdev down
      to the driver. Solve that by passing the original device down to the
      driver as part of the required switchdev object / attribute.
      
      This doesn't entail any change in current switchdev drivers. It simply
      enables those supporting stacked devices to know the originating device
      and act accordingly.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ff64f6f
  2. 09 12月, 2015 4 次提交
    • T
      sock, cgroup: add sock->sk_cgroup · bd1060a1
      Tejun Heo 提交于
      In cgroup v1, dealing with cgroup membership was difficult because the
      number of membership associations was unbound.  As a result, cgroup v1
      grew several controllers whose primary purpose is either tagging
      membership or pull in configuration knobs from other subsystems so
      that cgroup membership test can be avoided.
      
      net_cls and net_prio controllers are examples of the latter.  They
      allow configuring network-specific attributes from cgroup side so that
      network subsystem can avoid testing cgroup membership; unfortunately,
      these are not only cumbersome but also problematic.
      
      Both net_cls and net_prio aren't properly hierarchical.  Both inherit
      configuration from the parent on creation but there's no interaction
      afterwards.  An ancestor doesn't restrict the behavior in its subtree
      in anyway and configuration changes aren't propagated downwards.
      Especially when combined with cgroup delegation, this is problematic
      because delegatees can mess up whatever network configuration
      implemented at the system level.  net_prio would allow the delegatees
      to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
      the same for classid.
      
      While it is possible to solve these issues from controller side by
      implementing hierarchical allowable ranges in both controllers, it
      would involve quite a bit of complexity in the controllers and further
      obfuscate network configuration as it becomes even more difficult to
      tell what's actually being configured looking from the network side.
      While not much can be done for v1 at this point, as membership
      handling is sane on cgroup v2, it'd be better to make cgroup matching
      behave like other network matches and classifiers than introducing
      further complications.
      
      In preparation, this patch updates sock->sk_cgrp_data handling so that
      it points to the v2 cgroup that sock was created in until either
      net_prio or net_cls is used.  Once either of the two is used,
      sock->sk_cgrp_data reverts to its previous role of carrying prioidx
      and classid.  This is to avoid adding yet another cgroup related field
      to struct sock.
      
      As the mode switching can happen at most once per boot, the switching
      mechanism is aimed at lowering hot path overhead.  It may leak a
      finite, likely small, number of cgroup refs and report spurious
      prioidx or classid on switching; however, dynamic updates of prioidx
      and classid have always been racy and lossy - socks between creation
      and fd installation are never updated, config changes don't update
      existing sockets at all, and prioidx may index with dead and recycled
      cgroup IDs.  Non-critical inaccuracies from small race windows won't
      make any noticeable difference.
      
      This patch doesn't make use of the pointer yet.  The following patch
      will implement netfilter match for cgroup2 membership.
      
      v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
          cgroup specific field.
      
      v3: Add comments explaining why sock_data_prioidx() and
          sock_data_classid() use different fallback values.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
      CC: Neil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd1060a1
    • T
      net: wrap sock->sk_cgrp_prioidx and ->sk_classid inside a struct · 2a56a1fe
      Tejun Heo 提交于
      Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data.
      ->sk_cgroup_prioidx and ->sk_classid are moved into it.  The struct
      and its accessors are defined in cgroup-defs.h.  This is to prepare
      for overloading the fields with a cgroup pointer.
      
      This patch mostly performs equivalent conversions but the followings
      are noteworthy.
      
      * Equality test before updating classid is removed from
        sock_update_classid().  This shouldn't make any noticeable
        difference and a similar test will be implemented on the helper side
        later.
      
      * sock_update_netprioidx() now takes struct sock_cgroup_data and can
        be moved to netprio_cgroup.h without causing include dependency
        loop.  Moved.
      
      * The dummy version of sock_update_netprioidx() converted to a static
        inline function while at it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a56a1fe
    • T
      netprio_cgroup: limit the maximum css->id to USHRT_MAX · 297dbde1
      Tejun Heo 提交于
      netprio builds per-netdev contiguous priomap array which is indexed by
      css->id.  The array is allocated using kzalloc() effectively limiting
      the maximum ID supported to some thousand range.  This patch caps the
      maximum supported css->id to USHRT_MAX which should be way above what
      is actually useable.
      
      This allows reducing sock->sk_cgrp_prioidx to u16 from u32.  The freed
      up part will be used to overload the cgroup related fields.
      sock->sk_cgrp_prioidx's position is swapped with sk_mark so that the
      two cgroup related fields are adjacent.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      CC: Neil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      297dbde1
    • R
      net: Fix inverted test in __skb_recv_datagram · 760a4322
      Rainer Weikusat 提交于
      As the kernel generally uses negated error numbers, *err needs to be
      compared with -EAGAIN (d'oh).
      Signed-off-by: NRainer Weikusat <rweikusat@mobileactivedefense.com>
      Fixes: ea3793ee ("core: enable more fine-grained datagram reception control")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      760a4322
  3. 07 12月, 2015 1 次提交
    • R
      core: enable more fine-grained datagram reception control · ea3793ee
      Rainer Weikusat 提交于
      The __skb_recv_datagram routine in core/ datagram.c provides a general
      skb reception factility supposed to be utilized by protocol modules
      providing datagram sockets. It encompasses both the actual recvmsg code
      and a surrounding 'sleep until data is available' loop. This is
      inconvenient if a protocol module has to use additional locking in order
      to maintain some per-socket state the generic datagram socket code is
      unaware of (as the af_unix code does). The patch below moves the recvmsg
      proper code into a new __skb_try_recv_datagram routine which doesn't
      sleep and renames wait_for_more_packets to
      __skb_wait_for_more_packets, both routines being exported interfaces. The
      original __skb_recv_datagram routine is reimplemented on top of these
      two functions such that its user-visible behaviour remains unchanged.
      Signed-off-by: NRainer Weikusat <rweikusat@mobileactivedefense.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea3793ee
  4. 06 12月, 2015 1 次提交
  5. 04 12月, 2015 5 次提交
  6. 03 12月, 2015 1 次提交
  7. 02 12月, 2015 2 次提交
    • E
      net: fix sock_wake_async() rcu protection · ceb5d58b
      Eric Dumazet 提交于
      Dmitry provided a syzkaller (http://github.com/google/syzkaller)
      triggering a fault in sock_wake_async() when async IO is requested.
      
      Said program stressed af_unix sockets, but the issue is generic
      and should be addressed in core networking stack.
      
      The problem is that by the time sock_wake_async() is called,
      we should not access the @flags field of 'struct socket',
      as the inode containing this socket might be freed without
      further notice, and without RCU grace period.
      
      We already maintain an RCU protected structure, "struct socket_wq"
      so moving SOCKWQ_ASYNC_NOSPACE & SOCKWQ_ASYNC_WAITDATA into it
      is the safe route.
      
      It also reduces number of cache lines needing dirtying, so might
      provide a performance improvement anyway.
      
      In followup patches, we might move remaining flags (SOCK_NOSPACE,
      SOCK_PASSCRED, SOCK_PASSSEC) to save 8 bytes and let 'struct socket'
      being mostly read and let it being shared between cpus.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ceb5d58b
    • E
      net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA · 9cd3e072
      Eric Dumazet 提交于
      This patch is a cleanup to make following patch easier to
      review.
      
      Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
      from (struct socket)->flags to a (struct socket_wq)->flags
      to benefit from RCU protection in sock_wake_async()
      
      To ease backports, we rename both constants.
      
      Two new helpers, sk_set_bit(int nr, struct sock *sk)
      and sk_clear_bit(int net, struct sock *sk) are added so that
      following patch can change their implementation.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9cd3e072
  8. 01 12月, 2015 1 次提交
  9. 24 11月, 2015 1 次提交
    • N
      cgroups: Allow dynamically changing net_classid · 3b13758f
      Nina Schiff 提交于
      The classid of a process is changed either when a process is moved to
      or from a cgroup or when the net_cls.classid file is updated.
      Previously net_cls only supported propogating these changes to the
      cgroup's related sockets when a process was added or removed from the
      cgroup. This means it was neccessary to remove and re-add all processes
      to a cgroup in order to update its classid. This change introduces
      support for doing this dynamically - i.e. when the value is changed in
      the net_cls_classid file, this will also trigger an update to the
      classid associated with all sockets controlled by the cgroup.
      This mimics the behaviour of other cgroup subsystems.
      net_prio circumvents this issue by storing an index into a table with
      each socket (and so any updates to the table, don't require updating
      the value associated with the socket). net_cls, however, passes the
      socket the classid directly, and so this additional step is needed.
      Signed-off-by: NNina Schiff <ninasc@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b13758f
  10. 23 11月, 2015 2 次提交
    • D
      net, scm: fix PaX detected msg_controllen overflow in scm_detach_fds · 6900317f
      Daniel Borkmann 提交于
      David and HacKurx reported a following/similar size overflow triggered
      in a grsecurity kernel, thanks to PaX's gcc size overflow plugin:
      
      (Already fixed in later grsecurity versions by Brad and PaX Team.)
      
      [ 1002.296137] PAX: size overflow detected in function scm_detach_fds net/core/scm.c:314
                     cicus.202_127 min, count: 4, decl: msg_controllen; num: 0; context: msghdr;
      [ 1002.296145] CPU: 0 PID: 3685 Comm: scm_rights_recv Not tainted 4.2.3-grsec+ #7
      [ 1002.296149] Hardware name: Apple Inc. MacBookAir5,1/Mac-66F35F19FE2A0D05, [...]
      [ 1002.296153]  ffffffff81c27366 0000000000000000 ffffffff81c27375 ffffc90007843aa8
      [ 1002.296162]  ffffffff818129ba 0000000000000000 ffffffff81c27366 ffffc90007843ad8
      [ 1002.296169]  ffffffff8121f838 fffffffffffffffc fffffffffffffffc ffffc90007843e60
      [ 1002.296176] Call Trace:
      [ 1002.296190]  [<ffffffff818129ba>] dump_stack+0x45/0x57
      [ 1002.296200]  [<ffffffff8121f838>] report_size_overflow+0x38/0x60
      [ 1002.296209]  [<ffffffff816a979e>] scm_detach_fds+0x2ce/0x300
      [ 1002.296220]  [<ffffffff81791899>] unix_stream_read_generic+0x609/0x930
      [ 1002.296228]  [<ffffffff81791c9f>] unix_stream_recvmsg+0x4f/0x60
      [ 1002.296236]  [<ffffffff8178dc00>] ? unix_set_peek_off+0x50/0x50
      [ 1002.296243]  [<ffffffff8168fac7>] sock_recvmsg+0x47/0x60
      [ 1002.296248]  [<ffffffff81691522>] ___sys_recvmsg+0xe2/0x1e0
      [ 1002.296257]  [<ffffffff81693496>] __sys_recvmsg+0x46/0x80
      [ 1002.296263]  [<ffffffff816934fc>] SyS_recvmsg+0x2c/0x40
      [ 1002.296271]  [<ffffffff8181a3ab>] entry_SYSCALL_64_fastpath+0x12/0x85
      
      Further investigation showed that this can happen when an *odd* number of
      fds are being passed over AF_UNIX sockets.
      
      In these cases CMSG_LEN(i * sizeof(int)) and CMSG_SPACE(i * sizeof(int)),
      where i is the number of successfully passed fds, differ by 4 bytes due
      to the extra CMSG_ALIGN() padding in CMSG_SPACE() to an 8 byte boundary
      on 64 bit. The padding is used to align subsequent cmsg headers in the
      control buffer.
      
      When the control buffer passed in from the receiver side *lacks* these 4
      bytes (e.g. due to buggy/wrong API usage), then msg->msg_controllen will
      overflow in scm_detach_fds():
      
        int cmlen = CMSG_LEN(i * sizeof(int));  <--- cmlen w/o tail-padding
        err = put_user(SOL_SOCKET, &cm->cmsg_level);
        if (!err)
          err = put_user(SCM_RIGHTS, &cm->cmsg_type);
        if (!err)
          err = put_user(cmlen, &cm->cmsg_len);
        if (!err) {
          cmlen = CMSG_SPACE(i * sizeof(int));  <--- cmlen w/ 4 byte extra tail-padding
          msg->msg_control += cmlen;
          msg->msg_controllen -= cmlen;         <--- iff no tail-padding space here ...
        }                                            ... wrap-around
      
      F.e. it will wrap to a length of 18446744073709551612 bytes in case the
      receiver passed in msg->msg_controllen of 20 bytes, and the sender
      properly transferred 1 fd to the receiver, so that its CMSG_LEN results
      in 20 bytes and CMSG_SPACE in 24 bytes.
      
      In case of MSG_CMSG_COMPAT (scm_detach_fds_compat()), I haven't seen an
      issue in my tests as alignment seems always on 4 byte boundary. Same
      should be in case of native 32 bit, where we end up with 4 byte boundaries
      as well.
      
      In practice, passing msg->msg_controllen of 20 to recvmsg() while receiving
      a single fd would mean that on successful return, msg->msg_controllen is
      being set by the kernel to 24 bytes instead, thus more than the input
      buffer advertised. It could f.e. become an issue if such application later
      on zeroes or copies the control buffer based on the returned msg->msg_controllen
      elsewhere.
      
      Maximum number of fds we can send is a hard upper limit SCM_MAX_FD (253).
      
      Going over the code, it seems like msg->msg_controllen is not being read
      after scm_detach_fds() in scm_recv() anymore by the kernel, good!
      
      Relevant recvmsg() handler are unix_dgram_recvmsg() (unix_seqpacket_recvmsg())
      and unix_stream_recvmsg(). Both return back to their recvmsg() caller,
      and ___sys_recvmsg() places the updated length, that is, new msg_control -
      old msg_control pointer into msg->msg_controllen (hence the 24 bytes seen
      in the example).
      
      Long time ago, Wei Yongjun fixed something related in commit 1ac70e7a
      ("[NET]: Fix function put_cmsg() which may cause usr application memory
      overflow").
      
      RFC3542, section 20.2. says:
      
        The fields shown as "XX" are possible padding, between the cmsghdr
        structure and the data, and between the data and the next cmsghdr
        structure, if required by the implementation. While sending an
        application may or may not include padding at the end of last
        ancillary data in msg_controllen and implementations must accept both
        as valid. On receiving a portable application must provide space for
        padding at the end of the last ancillary data as implementations may
        copy out the padding at the end of the control message buffer and
        include it in the received msg_controllen. When recvmsg() is called
        if msg_controllen is too small for all the ancillary data items
        including any trailing padding after the last item an implementation
        may set MSG_CTRUNC.
      
      Since we didn't place MSG_CTRUNC for already quite a long time, just do
      the same as in 1ac70e7a to avoid an overflow.
      
      Btw, even man-page author got this wrong :/ See db939c9b26e9 ("cmsg.3: Fix
      error in SCM_RIGHTS code sample"). Some people must have copied this (?),
      thus it got triggered in the wild (reported several times during boot by
      David and HacKurx).
      
      No Fixes tag this time as pre 2002 (that is, pre history tree).
      Reported-by: NDavid Sterba <dave@jikos.cz>
      Reported-by: NHacKurx <hackurx@gmail.com>
      Cc: PaX Team <pageexec@freemail.hu>
      Cc: Emese Revfy <re.emese@gmail.com>
      Cc: Brad Spengler <spender@grsecurity.net>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: Eric Dumazet <edumazet@google.com>
      Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6900317f
    • D
      net: IPv6 fib lookup tracepoint · b811580d
      David Ahern 提交于
      Add tracepoint to show fib6 table lookups and result.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b811580d
  11. 21 11月, 2015 1 次提交
  12. 19 11月, 2015 9 次提交
    • E
      net: provide generic busy polling to all NAPI drivers · 93d05d4a
      Eric Dumazet 提交于
      NAPI drivers no longer need to observe a particular protocol
      to benefit from busy polling (CONFIG_NET_RX_BUSY_POLL=y)
      
      napi_hash_add() and napi_hash_del() are automatically called
      from core networking stack, respectively from
      netif_napi_add() and netif_napi_del()
      
      This patch depends on free_netdev() and netif_napi_del() being
      called from process context, which seems to be the norm.
      
      Drivers might still prefer to call napi_hash_del() on their
      own, since they might combine all the rcu grace periods into
      a single one, knowing their NAPI structures lifetime, while
      core networking stack has no idea of a possible combining.
      
      Once this patch proves to not bring serious regressions,
      we will cleanup drivers to either remove napi_hash_del()
      or provide appropriate rcu grace periods combining.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93d05d4a
    • E
      net: napi_hash_del() returns a boolean status · 34cbe27e
      Eric Dumazet 提交于
      napi_hash_del() will soon be used from both drivers (if they want)
      or core networking stack.
      
      Callers are responsibles to ensure an RCU grace period is respected
      before freeing napi structure : napi_hash_del() can signal if
      this RCU grace period is needed or not.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34cbe27e
    • E
      net: move napi_hash[] into read mostly section · 6180d9de
      Eric Dumazet 提交于
      We do not often add/delete a napi context.
      Moving napi_hash[] into read mostly section avoids potential false sharing.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6180d9de
    • E
      net: add netif_tx_napi_add() · d64b5e85
      Eric Dumazet 提交于
      netif_tx_napi_add() is a variant of netif_napi_add()
      
      It should be used by drivers that use a napi structure
      to exclusively poll TX.
      
      We do not want to add this kind of napi in napi_hash[] in following
      patches, adding generic busy polling to all NAPI drivers.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d64b5e85
    • E
      net: move skb_mark_napi_id() into core networking stack · 93f93a44
      Eric Dumazet 提交于
      We would like to automatically provide busy polling support
      to all NAPI drivers, without them having to implement anything.
      
      skb_mark_napi_id() can be called from napi_gro_receive() and
      napi_get_frags().
      
      Few drivers are still calling skb_mark_napi_id() because
      they use netif_receive_skb(). They should eventually call
      napi_gro_receive() instead. I will leave this to drivers
      maintainers.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93f93a44
    • E
      net: network drivers no longer need to implement ndo_busy_poll() · ce6aea93
      Eric Dumazet 提交于
      Instead of having to implement complex ndo_busy_poll() method,
      drivers can simply rely on NAPI poll logic.
      
      Busy polling gains are mainly coming from polling itself,
      not on exact details on how we poll the device.
      
      ndo_busy_poll() if implemented can avoid touching
      napi state, but it adds extra synchronization between
      normal napi->poll() and busy poll handler, slowing down
      the common path (non busy polling) with extra atomic operations.
      In practice few drivers ever got busy poll because of the complexity.
      
      We could go one step further, and make busy polling
      available for all NAPI drivers, but this would require
      that all netif_napi_del() calls are done in process context
      so that we can call synchronize_rcu().
      Full audit would be required.
      
      Before this is done, a driver still needs to call :
      
      - skb_mark_napi_id() for each skb provided to the stack.
      - napi_hash_add() and napi_hash_del() to allocate a napi_id per napi struct.
      - Make sure RCU grace period is respected after napi_hash_del() before
        memory containing napi structure is freed.
      
      Followup patch implements busy poll for mlx5 driver as an example.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce6aea93
    • E
      net: allow BH servicing in sk_busy_loop() · 2a028ecb
      Eric Dumazet 提交于
      Instead of blocking BH in whole sk_busy_loop(), block them
      only around ->ndo_busy_poll() calls.
      
      This has many benefits.
      
      1) allow tunneled traffic to use busy poll as well as native traffic.
         Tunnels handlers usually call netif_rx() and depend on net_rx_action()
         being run (from sofirq handler)
      
      2) allow RFS/RPS being used (sending IPI to other cpus if needed)
      
      3) use the 'lets burn cpu cycles' budget to do useful work
         (like TX completions, timers, RCU callbacks...)
      
      4) reduce BH latencies, making busy poll a better citizen.
      
      Tested:
      
      Tested with SIT tunnel
      
      lpaa5:~# echo 0 >/proc/sys/net/core/busy_read
      lpaa5:~# ./netperf -H 2002:af6:786::1 -t TCP_RR
      MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:786::1 () port 0 AF_INET6 : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    37373.93
      16384  87380
      
      Now enable busy poll on both hosts
      
      lpaa5:~# echo 70 >/proc/sys/net/core/busy_read
      lpaa6:~# echo 70 >/proc/sys/net/core/busy_read
      
      lpaa5:~# ./netperf -H 2002:af6:786::1 -t TCP_RR
      MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:786::1 () port 0 AF_INET6 : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    58314.77
      16384  87380
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a028ecb
    • E
      net: un-inline sk_busy_loop() · 02d62e86
      Eric Dumazet 提交于
      There is really little gain from inlining this big function.
      We'll soon make it even bigger in following patches.
      
      This means we no longer need to export napi_by_id()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02d62e86
    • E
      net: better skb->sender_cpu and skb->napi_id cohabitation · 52bd2d62
      Eric Dumazet 提交于
      skb->sender_cpu and skb->napi_id share a common storage,
      and we had various bugs about this.
      
      We had to call skb_sender_cpu_clear() in some places to
      not leave a prior skb->napi_id and fool netdev_pick_tx()
      
      As suggested by Alexei, we could split the space so that
      these errors can not happen.
      
      0 value being reserved as the common (not initialized) value,
      let's reserve [1 .. NR_CPUS] range for valid sender_cpu,
      and [NR_CPUS+1 .. ~0U] for valid napi_id.
      
      This will allow proper busy polling support over tunnels.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52bd2d62
  13. 18 11月, 2015 4 次提交
  14. 17 11月, 2015 3 次提交
  15. 10 11月, 2015 1 次提交
  16. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc