1. 04 8月, 2017 5 次提交
    • W
      sock: skb_copy_ubufs support for compound pages · 3ece7826
      Willem de Bruijn 提交于
      Refine skb_copy_ubufs to support compound pages. With upcoming TCP
      zerocopy sendmsg, such fragments may appear.
      
      The existing code replaces each page one for one. Splitting each
      compound page into an independent number of regular pages can result
      in exceeding limit MAX_SKB_FRAGS if data is not exactly page aligned.
      
      Instead, fill all destination pages but the last to PAGE_SIZE.
      Split the existing alloc + copy loop into separate stages:
      1. compute bytelength and minimum number of pages to store this.
      2. allocate
      3. copy, filling each page except the last to PAGE_SIZE bytes
      4. update skb frag array
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ece7826
    • W
      sock: allocate skbs from optmem · 98ba0bd5
      Willem de Bruijn 提交于
      Add sock_omalloc and sock_ofree to be able to allocate control skbs,
      for instance for looping errors onto sk_error_queue.
      
      The transmit budget (sk_wmem_alloc) is involved in transmit skb
      shaping, most notably in TCP Small Queues. Using this budget for
      control packets would impact transmission.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98ba0bd5
    • I
      net: fib_rules: Implement notification logic in core · 1b2a4440
      Ido Schimmel 提交于
      Unlike the routing tables, the FIB rules share a common core, so instead
      of replicating the same logic for each address family we can simply dump
      the rules and send notifications from the core itself.
      
      To protect the integrity of the dump, a rules-specific sequence counter
      is added for each address family and incremented whenever a rule is
      added or deleted (under RTNL).
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b2a4440
    • I
      net: core: Make the FIB notification chain generic · 04b1d4e5
      Ido Schimmel 提交于
      The FIB notification chain is currently soley used by IPv4 code.
      However, we're going to introduce IPv6 FIB offload support, which
      requires these notification as well.
      
      As explained in commit c3852ef7 ("ipv4: fib: Replay events when
      registering FIB notifier"), upon registration to the chain, the callee
      receives a full dump of the FIB tables and rules by traversing all the
      net namespaces. The integrity of the dump is ensured by a per-namespace
      sequence counter that is incremented whenever a change to the tables or
      rules occurs.
      
      In order to allow more address families to use the chain, each family is
      expected to register its fib_notifier_ops in its pernet init. These
      operations allow the common code to read the family's sequence counter
      as well as dump its tables and rules in the given net namespace.
      
      Additionally, a 'family' parameter is added to sent notifications, so
      that listeners could distinguish between the different families.
      
      Implement the common code that allows listeners to register to the chain
      and for address families to register their fib_notifier_ops. Subsequent
      patches will implement these operations in IPv6.
      
      In the future, ipmr and ip6mr will be extended to provide these
      notifications as well.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04b1d4e5
    • W
      bpf: fix the printing of ifindex · eb48d682
      William Tu 提交于
      Save the ifindex before it gets zeroed so the invalid
      ifindex can be printed out.
      Signed-off-by: NWilliam Tu <u9012063@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb48d682
  2. 03 8月, 2017 1 次提交
  3. 02 8月, 2017 3 次提交
  4. 30 7月, 2017 2 次提交
    • V
      net: ethtool: add support for forward error correction modes · 1a5f3da2
      Vidya Sagar Ravipati 提交于
      Forward Error Correction (FEC) modes i.e Base-R
      and Reed-Solomon modes are introduced in 25G/40G/100G standards
      for providing good BER at high speeds. Various networking devices
      which support 25G/40G/100G provides ability to manage supported FEC
      modes and the lack of FEC encoding control and reporting today is a
      source for interoperability issues for many vendors.
      FEC capability as well as specific FEC mode i.e. Base-R
      or RS modes can be requested or advertised through bits D44:47 of
      base link codeword.
      
      This patch set intends to provide option under ethtool to manage
      and report FEC encoding settings for networking devices as per
      IEEE 802.3 bj, bm and by specs.
      
      set-fec/show-fec option(s) are designed to provide control and
      report the FEC encoding on the link.
      
      SET FEC option:
      root@tor: ethtool --set-fec  swp1 encoding [off | RS | BaseR | auto]
      
      Encoding: Types of encoding
      Off    :  Turning off any encoding
      RS     :  enforcing RS-FEC encoding on supported speeds
      BaseR  :  enforcing Base R encoding on supported speeds
      Auto   :  IEEE defaults for the speed/medium combination
      
      Here are a few examples of what we would expect if encoding=auto:
      - if autoneg is on, we are  expecting FEC to be negotiated as on or off
        as long as protocol supports it
      - if the hardware is capable of detecting the FEC encoding on it's
            receiver it will reconfigure its encoder to match
      - in absence of the above, the configuration would be set to IEEE
        defaults.
      
      >From our  understanding , this is essentially what most hardware/driver
      combinations are doing today in the absence of a way for users to
      control the behavior.
      
      SHOW FEC option:
      root@tor: ethtool --show-fec  swp1
      FEC parameters for swp1:
      Active FEC encodings: RS
      Configured FEC encodings:  RS | BaseR
      
      ETHTOOL DEVNAME output modification:
      
      ethtool devname output:
      root@tor:~# ethtool swp1
      Settings for swp1:
      root@hpe-7712-03:~# ethtool swp18
      Settings for swp18:
          Supported ports: [ FIBRE ]
          Supported link modes:   40000baseCR4/Full
                                  40000baseSR4/Full
                                  40000baseLR4/Full
                                  100000baseSR4/Full
                                  100000baseCR4/Full
                                  100000baseLR4_ER4/Full
          Supported pause frame use: No
          Supports auto-negotiation: Yes
          Supported FEC modes: [RS | BaseR | None | Not reported]
          Advertised link modes:  Not reported
          Advertised pause frame use: No
          Advertised auto-negotiation: No
          Advertised FEC modes: [RS | BaseR | None | Not reported]
      <<<< One or more FEC modes
          Speed: 100000Mb/s
          Duplex: Full
          Port: FIBRE
          PHYAD: 106
          Transceiver: internal
          Auto-negotiation: off
          Link detected: yes
      
      This patch includes following changes
      a) New ETHTOOL_SFECPARAM/SFECPARAM API, handled by
        the new get_fecparam/set_fecparam callbacks, provides support
        for configuration of forward error correction modes.
      b) Link mode bits for FEC modes i.e. None (No FEC mode), RS, BaseR/FC
        are defined so that users can configure these fec modes for supported
        and advertising fields as part of link autonegotiation.
      Signed-off-by: NVidya Sagar Ravipati <vidya.chowdary@gmail.com>
      Signed-off-by: NDustin Byford <dustin@cumulusnetworks.com>
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a5f3da2
    • W
      net: check dev->addr_len for dev_set_mac_address() · 0254e0c6
      WANG Cong 提交于
      Historically, dev_ifsioc() uses struct sockaddr as mac
      address definition, this is why dev_set_mac_address()
      accepts a struct sockaddr pointer as input but now we
      have various types of mac addresse whose lengths
      are up to MAX_ADDR_LEN, longer than struct sockaddr,
      and saved in dev->addr_len.
      
      It is too late to fix dev_ifsioc() due to API
      compatibility, so just reject those larger than
      sizeof(struct sockaddr), otherwise we would read
      and use some random bytes from kernel stack.
      
      Fortunately, only a few IPv6 tunnel devices have addr_len
      larger than sizeof(struct sockaddr) and they don't support
      ndo_set_mac_addr(). But with team driver, in lb mode, they
      can still be enslaved to a team master and make its mac addr
      length as the same.
      
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0254e0c6
  5. 27 7月, 2017 1 次提交
  6. 25 7月, 2017 3 次提交
  7. 21 7月, 2017 1 次提交
  8. 20 7月, 2017 4 次提交
  9. 19 7月, 2017 1 次提交
  10. 18 7月, 2017 10 次提交
  11. 17 7月, 2017 1 次提交
  12. 14 7月, 2017 3 次提交
  13. 13 7月, 2017 2 次提交
    • M
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko 提交于
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
    • S
      datagram: fix kernel-doc comments · d3f6cd9e
      stephen hemminger 提交于
      An underscore in the kernel-doc comment section has special meaning
      and mis-use generates an errors.
      
      ./net/core/datagram.c:207: ERROR: Unknown target name: "msg".
      ./net/core/datagram.c:379: ERROR: Unknown target name: "msg".
      ./net/core/datagram.c:816: ERROR: Unknown target name: "t".
      Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3f6cd9e
  14. 08 7月, 2017 1 次提交
    • W
      bonding: avoid NETDEV_CHANGEMTU event when unregistering slave · f51048c3
      WANG Cong 提交于
      As Hongjun/Nicolas summarized in their original patch:
      
      "
      When a device changes from one netns to another, it's first unregistered,
      then the netns reference is updated and the dev is registered in the new
      netns. Thus, when a slave moves to another netns, it is first
      unregistered. This triggers a NETDEV_UNREGISTER event which is caught by
      the bonding driver. The driver calls bond_release(), which calls
      dev_set_mtu() and thus triggers NETDEV_CHANGEMTU (the device is still in
      the old netns).
      "
      
      This is a very special case, because the device is being unregistered
      no one should still care about the NETDEV_CHANGEMTU event triggered
      at this point, we can avoid broadcasting this event on this path,
      and avoid touching inetdev_event()/addrconf_notify() path.
      
      It requires to export __dev_set_mtu() to bonding driver.
      Reported-by: NHongjun Li <hongjun.li@6wind.com>
      Reported-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f51048c3
  15. 05 7月, 2017 1 次提交
  16. 03 7月, 2017 1 次提交
    • E
      net: avoid one splat in fib_nl_delrule() · 5361e209
      Eric Dumazet 提交于
      We need to use refcount_set() on a newly created rule to avoid
      following error :
      
      [   64.601749] ------------[ cut here ]------------
      [   64.601757] WARNING: CPU: 0 PID: 6476 at lib/refcount.c:184 refcount_sub_and_test+0x75/0xa0
      [   64.601758] Modules linked in: w1_therm wire cdc_acm ehci_pci ehci_hcd mlx4_en ib_uverbs mlx4_ib ib_core mlx4_core
      [   64.601769] CPU: 0 PID: 6476 Comm: ip Tainted: G        W       4.12.0-smp-DEV #274
      [   64.601771] task: ffff8837bf482040 task.stack: ffff8837bdc08000
      [   64.601773] RIP: 0010:refcount_sub_and_test+0x75/0xa0
      [   64.601774] RSP: 0018:ffff8837bdc0f5c0 EFLAGS: 00010286
      [   64.601776] RAX: 0000000000000026 RBX: 0000000000000001 RCX: 0000000000000000
      [   64.601777] RDX: 0000000000000026 RSI: 0000000000000096 RDI: ffffed06f7b81eae
      [   64.601778] RBP: ffff8837bdc0f5d0 R08: 0000000000000004 R09: fffffbfff4a54c25
      [   64.601779] R10: 00000000cbc500e5 R11: ffffffffa52a6128 R12: ffff881febcf6f24
      [   64.601779] R13: ffff881fbf4eaf00 R14: ffff881febcf6f80 R15: ffff8837d7a4ed00
      [   64.601781] FS:  00007ff5a2f6b700(0000) GS:ffff881fff800000(0000) knlGS:0000000000000000
      [   64.601782] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   64.601783] CR2: 00007ffcdc70d000 CR3: 0000001f9c91e000 CR4: 00000000001406f0
      [   64.601783] Call Trace:
      [   64.601786]  refcount_dec_and_test+0x11/0x20
      [   64.601790]  fib_nl_delrule+0xc39/0x1630
      [   64.601793]  ? is_bpf_text_address+0xe/0x20
      [   64.601795]  ? fib_nl_newrule+0x25e0/0x25e0
      [   64.601798]  ? depot_save_stack+0x133/0x470
      [   64.601801]  ? ns_capable+0x13/0x20
      [   64.601803]  ? __netlink_ns_capable+0xcc/0x100
      [   64.601806]  rtnetlink_rcv_msg+0x23a/0x6a0
      [   64.601808]  ? rtnl_newlink+0x1630/0x1630
      [   64.601811]  ? memset+0x31/0x40
      [   64.601813]  netlink_rcv_skb+0x2d7/0x440
      [   64.601815]  ? rtnl_newlink+0x1630/0x1630
      [   64.601816]  ? netlink_ack+0xaf0/0xaf0
      [   64.601818]  ? kasan_unpoison_shadow+0x35/0x50
      [   64.601820]  ? __kmalloc_node_track_caller+0x4c/0x70
      [   64.601821]  rtnetlink_rcv+0x28/0x30
      [   64.601823]  netlink_unicast+0x422/0x610
      [   64.601824]  ? netlink_attachskb+0x650/0x650
      [   64.601826]  netlink_sendmsg+0x7b7/0xb60
      [   64.601828]  ? netlink_unicast+0x610/0x610
      [   64.601830]  ? netlink_unicast+0x610/0x610
      [   64.601832]  sock_sendmsg+0xba/0xf0
      [   64.601834]  ___sys_sendmsg+0x6a9/0x8c0
      [   64.601835]  ? copy_msghdr_from_user+0x520/0x520
      [   64.601837]  ? __alloc_pages_nodemask+0x160/0x520
      [   64.601839]  ? memcg_write_event_control+0xd60/0xd60
      [   64.601841]  ? __alloc_pages_slowpath+0x1d50/0x1d50
      [   64.601843]  ? kasan_slab_free+0x71/0xc0
      [   64.601845]  ? mem_cgroup_commit_charge+0xb2/0x11d0
      [   64.601847]  ? lru_cache_add_active_or_unevictable+0x7d/0x1a0
      [   64.601849]  ? __handle_mm_fault+0x1af8/0x2810
      [   64.601851]  ? may_open_dev+0xc0/0xc0
      [   64.601852]  ? __pmd_alloc+0x2c0/0x2c0
      [   64.601853]  ? __fdget+0x13/0x20
      [   64.601855]  __sys_sendmsg+0xc6/0x150
      [   64.601856]  ? __sys_sendmsg+0xc6/0x150
      [   64.601857]  ? SyS_shutdown+0x170/0x170
      [   64.601859]  ? handle_mm_fault+0x28a/0x650
      [   64.601861]  SyS_sendmsg+0x12/0x20
      [   64.601863]  entry_SYSCALL_64_fastpath+0x13/0x94
      
      Fixes: 717d1e99 ("net: convert fib_rule.refcnt from atomic_t to refcount_t")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5361e209