1. 19 11月, 2015 19 次提交
    • E
      net: provide generic busy polling to all NAPI drivers · 93d05d4a
      Eric Dumazet 提交于
      NAPI drivers no longer need to observe a particular protocol
      to benefit from busy polling (CONFIG_NET_RX_BUSY_POLL=y)
      
      napi_hash_add() and napi_hash_del() are automatically called
      from core networking stack, respectively from
      netif_napi_add() and netif_napi_del()
      
      This patch depends on free_netdev() and netif_napi_del() being
      called from process context, which seems to be the norm.
      
      Drivers might still prefer to call napi_hash_del() on their
      own, since they might combine all the rcu grace periods into
      a single one, knowing their NAPI structures lifetime, while
      core networking stack has no idea of a possible combining.
      
      Once this patch proves to not bring serious regressions,
      we will cleanup drivers to either remove napi_hash_del()
      or provide appropriate rcu grace periods combining.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93d05d4a
    • E
      net: napi_hash_del() returns a boolean status · 34cbe27e
      Eric Dumazet 提交于
      napi_hash_del() will soon be used from both drivers (if they want)
      or core networking stack.
      
      Callers are responsibles to ensure an RCU grace period is respected
      before freeing napi structure : napi_hash_del() can signal if
      this RCU grace period is needed or not.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34cbe27e
    • E
      net: move napi_hash[] into read mostly section · 6180d9de
      Eric Dumazet 提交于
      We do not often add/delete a napi context.
      Moving napi_hash[] into read mostly section avoids potential false sharing.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6180d9de
    • E
      net: add netif_tx_napi_add() · d64b5e85
      Eric Dumazet 提交于
      netif_tx_napi_add() is a variant of netif_napi_add()
      
      It should be used by drivers that use a napi structure
      to exclusively poll TX.
      
      We do not want to add this kind of napi in napi_hash[] in following
      patches, adding generic busy polling to all NAPI drivers.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d64b5e85
    • E
      net: move skb_mark_napi_id() into core networking stack · 93f93a44
      Eric Dumazet 提交于
      We would like to automatically provide busy polling support
      to all NAPI drivers, without them having to implement anything.
      
      skb_mark_napi_id() can be called from napi_gro_receive() and
      napi_get_frags().
      
      Few drivers are still calling skb_mark_napi_id() because
      they use netif_receive_skb(). They should eventually call
      napi_gro_receive() instead. I will leave this to drivers
      maintainers.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93f93a44
    • E
      mlx4: remove mlx4_en_low_latency_recv() · 868fdb06
      Eric Dumazet 提交于
      Busy polling can now be handled in generic NAPI poll infrastructure.
      This removes complexity and fast path overhead :
      
      mlx4 used two spin_lock()/spin_unlock() pair per napi->poll() call
      in mlx4_en_cq_lock_napi()/mlx4_en_cq_unlock_napi()
      
      Tested:
      
      Without busy polling :
      
      lpaa23:~# echo 0 >/proc/sys/net/core/busy_read
      lpaa24:~# echo 0 >/proc/sys/net/core/busy_read
      lpaa23:~# ./netperf -H lpaa24 -t TCP_RR
      MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpaa24.prod.google.com () port 0 AF_INET : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    47330.78
      
      With busy polling :
      
      lpaa23:~# echo 70 >/proc/sys/net/core/busy_read
      lpaa24:~# echo 70 >/proc/sys/net/core/busy_read
      lpaa23:~# ./netperf -H lpaa24 -t TCP_RR
      MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpaa24.prod.google.com () port 0 AF_INET : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    97643.55
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      868fdb06
    • E
      bnx2x: remove bnx2x_low_latency_recv() support · b59768c6
      Eric Dumazet 提交于
      Switch to native NAPI polling, as this reduces overhead and complexity.
      
      Normal path is faster, since one cmpxchg() is not anymore requested,
      and busy polling with the NAPI polling has same performance.
      
      Tested:
      lpk50:~# cat /proc/sys/net/core/busy_read
      70
      lpk50:~# nstat >/dev/null;./netperf -H lpk55 -t TCP_RR;nstat
      MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpk55.prod.google.com () port 0 AF_INET : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    40095.07
      16384  87380
      IpInReceives                    401062             0.0
      IpInDelivers                    401062             0.0
      IpOutRequests                   401079             0.0
      TcpActiveOpens                  7                  0.0
      TcpPassiveOpens                 3                  0.0
      TcpAttemptFails                 3                  0.0
      TcpEstabResets                  5                  0.0
      TcpInSegs                       401036             0.0
      TcpOutSegs                      401052             0.0
      TcpOutRsts                      38                 0.0
      UdpInDatagrams                  26                 0.0
      UdpOutDatagrams                 27                 0.0
      Ip6OutNoRoutes                  1                  0.0
      TcpExtDelayedACKs               1                  0.0
      TcpExtTCPPrequeued              98                 0.0
      TcpExtTCPDirectCopyFromPrequeue 98                 0.0
      TcpExtTCPHPHits                 4                  0.0
      TcpExtTCPHPHitsToUser           98                 0.0
      TcpExtTCPPureAcks               5                  0.0
      TcpExtTCPHPAcks                 101                0.0
      TcpExtTCPAbortOnData            6                  0.0
      TcpExtBusyPollRxPackets         400832             0.0
      TcpExtTCPOrigDataSent           400983             0.0
      IpExtInOctets                   21273867           0.0
      IpExtOutOctets                  21261254           0.0
      IpExtInNoECTPkts                401064             0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b59768c6
    • E
      mlx5: support napi_complete_done() · 44fb6fbb
      Eric Dumazet 提交于
      A NAPI poll handler should return number of RX packets processed,
      instead of 0 / budget.
      
      This allows proper busy poll accounting through LINUX_MIB_BUSYPOLLRXPACKETS
      SNMP counter.
      
      napi_complete_done() allows /sys/class/net/ethX/gro_flush_timeout
      to be used for finer GRO aggregation control.
      
      Tested:
      
      Enabled busy polling, and checked TcpExtBusyPollRxPackets counter is increasing.
      
      echo 70 >/proc/sys/net/core/busy_read
      nstat >/dev/null
      netperf -H target -t TCP_RR >/dev/null
      nstat | grep TcpExtBusyPollRxPackets
      TcpExtBusyPollRxPackets         490958             0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Eli Cohen <eli@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44fb6fbb
    • E
      mlx5: add busy polling support · 7ae92ae5
      Eric Dumazet 提交于
      It is now easy to add busy polling support to a NAPI driver,
      with very little impact on normal input path.
      
      This patch serves as a reference implementation.
      
      Note:
      
      A followup patch will add proper napi_complete_done() in mlx5,
      so that LINUX_MIB_BUSYPOLLRXPACKETS snmp counter is properly handled.
      
      Tested:
      
      Normal TCP_RR results without busy polling :
      
      lpk51:~# echo 0 >/proc/sys/net/core/busy_read
      lpk52:~# echo 0 >/proc/sys/net/core/busy_read
      
      lpk51:~# ./netperf -H 192.168.4.52 -t TCP_RR -l 10
      MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.4.52 () port 0 AF_INET : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    53509.49
      16384  87380
      
      Now enable busy polling :
      
      lpk51:~# echo 70 >/proc/sys/net/core/busy_read
      lpk52:~# echo 70 >/proc/sys/net/core/busy_read
      
      lpk51:~# ./netperf -H 192.168.4.52 -t TCP_RR -l 10
      MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.4.52 () port 0 AF_INET : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    97530.92
      16384  87380
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ae92ae5
    • E
      net: network drivers no longer need to implement ndo_busy_poll() · ce6aea93
      Eric Dumazet 提交于
      Instead of having to implement complex ndo_busy_poll() method,
      drivers can simply rely on NAPI poll logic.
      
      Busy polling gains are mainly coming from polling itself,
      not on exact details on how we poll the device.
      
      ndo_busy_poll() if implemented can avoid touching
      napi state, but it adds extra synchronization between
      normal napi->poll() and busy poll handler, slowing down
      the common path (non busy polling) with extra atomic operations.
      In practice few drivers ever got busy poll because of the complexity.
      
      We could go one step further, and make busy polling
      available for all NAPI drivers, but this would require
      that all netif_napi_del() calls are done in process context
      so that we can call synchronize_rcu().
      Full audit would be required.
      
      Before this is done, a driver still needs to call :
      
      - skb_mark_napi_id() for each skb provided to the stack.
      - napi_hash_add() and napi_hash_del() to allocate a napi_id per napi struct.
      - Make sure RCU grace period is respected after napi_hash_del() before
        memory containing napi structure is freed.
      
      Followup patch implements busy poll for mlx5 driver as an example.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce6aea93
    • E
      net: allow BH servicing in sk_busy_loop() · 2a028ecb
      Eric Dumazet 提交于
      Instead of blocking BH in whole sk_busy_loop(), block them
      only around ->ndo_busy_poll() calls.
      
      This has many benefits.
      
      1) allow tunneled traffic to use busy poll as well as native traffic.
         Tunnels handlers usually call netif_rx() and depend on net_rx_action()
         being run (from sofirq handler)
      
      2) allow RFS/RPS being used (sending IPI to other cpus if needed)
      
      3) use the 'lets burn cpu cycles' budget to do useful work
         (like TX completions, timers, RCU callbacks...)
      
      4) reduce BH latencies, making busy poll a better citizen.
      
      Tested:
      
      Tested with SIT tunnel
      
      lpaa5:~# echo 0 >/proc/sys/net/core/busy_read
      lpaa5:~# ./netperf -H 2002:af6:786::1 -t TCP_RR
      MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:786::1 () port 0 AF_INET6 : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    37373.93
      16384  87380
      
      Now enable busy poll on both hosts
      
      lpaa5:~# echo 70 >/proc/sys/net/core/busy_read
      lpaa6:~# echo 70 >/proc/sys/net/core/busy_read
      
      lpaa5:~# ./netperf -H 2002:af6:786::1 -t TCP_RR
      MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:786::1 () port 0 AF_INET6 : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    58314.77
      16384  87380
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a028ecb
    • E
      net: un-inline sk_busy_loop() · 02d62e86
      Eric Dumazet 提交于
      There is really little gain from inlining this big function.
      We'll soon make it even bigger in following patches.
      
      This means we no longer need to export napi_by_id()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02d62e86
    • E
      mlx4: mlx4_en_low_latency_recv() called with BH disabled · 5865316c
      Eric Dumazet 提交于
      mlx4_en_low_latency_recv() is called with BH disabled,
      as other ndo_busy_poll() methods.
      
      No need for spin_lock_bh()/spin_unlock_bh()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5865316c
    • E
      net: better skb->sender_cpu and skb->napi_id cohabitation · 52bd2d62
      Eric Dumazet 提交于
      skb->sender_cpu and skb->napi_id share a common storage,
      and we had various bugs about this.
      
      We had to call skb_sender_cpu_clear() in some places to
      not leave a prior skb->napi_id and fool netdev_pick_tx()
      
      As suggested by Alexei, we could split the space so that
      these errors can not happen.
      
      0 value being reserved as the common (not initialized) value,
      let's reserve [1 .. NR_CPUS] range for valid sender_cpu,
      and [NR_CPUS+1 .. ~0U] for valid napi_id.
      
      This will allow proper busy polling support over tunnels.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52bd2d62
    • I
      be2net: remove local variable 'status' · d37b4c0a
      Ivan Vecera 提交于
      The lancer_cmd_get_file_len() uses lancer_cmd_read_object() to get
      the current size of registers for ethtool registers dump. Returned status
      value is stored but not checked. The check itself is not necessary as
      the data_read output variable is initialized to 0 and status variable
      can be removed.
      Signed-off-by: NIvan Vecera <ivecera@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d37b4c0a
    • H
      net: hisilicon: fix binding document of mdio · 6fbaa570
      huangdaode 提交于
      This patch explains the occasion of "hisilcon,mdio" and
      "hisilicon,hns-mdio" according to Arnd's comments.
      and reformat it according to comments from Rob<robh@kernel.org>.
      Signed-off-by: Nhuangdaode <huangdaode@hisilicon.com>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6fbaa570
    • B
      net ipv4: use preferred log methods · 09605cc1
      Bastian Stender 提交于
      Replace printk calls with preferred unconditional log method calls to keep
      kernel messages clean.
      
      Added newline to "too small MTU" message.
      Signed-off-by: NBastian Stender <bst@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09605cc1
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · 34258a32
      Linus Torvalds 提交于
      Pull s390 fixes from Martin Schwidefsky:
       "Assorted bug fixes, the mlock2 system call gets added, and one
        improvement.  The boot from dasd devices is now possible from a wider
        range of devices"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
        s390: remove SALIPL loader
        s390: wire up mlock2 system call
        s390: remove g5 elf platform support
        s390: avoid cache aliasing under z/VM and KVM
        s390/sclp: _sclp_wait_int(): retain full PSW mask
        s390/zcrypt: Fix initialisation when zcrypt is built-in
        s390/zcrypt: Fix kernel crash on systems without AP bus support
        s390: add support for ipl devices in subchannel sets > 0
        s390/ipl: fix out of bounds access in scpdata_write
        s390/pci_dma: improve debugging of errors during dma map
        s390/pci_dma: handle dma table failures
        s390/pci_dma: unify label of invalid translation table entries
        s390/syscalls: remove system call number calculation
        s390/cio: simplify css_generate_pgid
        s390/diag: add a s390 prefix to the diagnose trace point
        s390/head: fix error message on unsupported hardware
      34258a32
    • L
      Merge tag 'hwmon-for-linus-v4.4-rc2' of... · 0d77a123
      Linus Torvalds 提交于
      Merge tag 'hwmon-for-linus-v4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
      
      Pull hwmon fixes from Guenter Roeck:
       "Fix build issues in scpi and ina2xx drivers, update scpi driver to
        support recent firmware, and fix an uninitialized variable warning in
        applesmc driver"
      
      * tag 'hwmon-for-linus-v4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
        hwmon: (scpi) skip unsupported sensors properly
        hwmon: (scpi) add thermal-of dependency
        hwmon : (applesmc) Fix uninitialized variables warnings
        hwmon: (ina2xx) Fix build issue by selecting REGMAP_I2C
      0d77a123
  2. 18 11月, 2015 20 次提交
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 7f151f1d
      Linus Torvalds 提交于
      Pull networking fixes from David Miller:
      
       1) Fix list tests in netfilter ingress support, from Florian Westphal.
      
       2) Fix reversal of input and output interfaces in ingress hook
          invocation, from Pablo Neira Ayuso.
      
       3) We have a use after free in r8169, caught by Dave Jones, fixed by
          Francois Romieu.
      
       4) Splice use-after-free fix in AF_UNIX frmo Hannes Frederic Sowa.
      
       5) Three ipv6 route handling bug fixes from Martin KaFai Lau:
          a) Don't create clone routes not managed by the fib6 tree
          b) Don't forget to check expiration of DST_NOCACHE routes.
          c) Handle rt->dst.from == NULL properly.
      
       6) Several AF_PACKET fixes wrt transport header setting and SKB
          protocol setting, from Daniel Borkmann.
      
       7) Fix thunder driver crash on shutdown, from Pavel Fedin.
      
       8) Several Mellanox driver fixes (max MTU calculations, use of correct
          DMA unmap in TX path, etc.) from Saeed Mahameed, Tariq Toukan, Doron
          Tsur, Achiad Shochat, Eran Ben Elisha, and Noa Osherovich.
      
       9) Several mv88e6060 DSA driver fixes (wrong bit definitions for
          certain registers, etc.) from Neil Armstrong.
      
      10) Make sure to disable preemption while updating per-cpu stats of ip
          tunnels, from Jason A.  Donenfeld.
      
      11) Various ARM64 bpf JIT fixes, from Yang Shi.
      
      12) Flush icache properly in ARM JITs, from Daniel Borkmann.
      
      13) Fix masking of RX and TX interrupts in ravb driver, from Masaru
          Nagai.
      
      14) Fix netdev feature propagation for devices not implementing
          ->ndo_set_features().  From Nikolay Aleksandrov.
      
      15) Big endian fix in vmxnet3 driver, from Shrikrishna Khare.
      
      16) RAW socket code increments incorrect SNMP counters, fix from Ben
          Cartwright-Cox.
      
      17) IPv6 multicast SNMP counters are bumped twice, fix from Neil Horman.
      
      18) Fix handling of VLAN headers on stacked devices when REORDER is
          disabled.  From Vlad Yasevich.
      
      19) Fix SKB leaks and use-after-free in ipvlan and macvlan drivers, from
          Sabrina Dubroca.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (83 commits)
        MAINTAINERS: Update Mellanox's Eth NIC driver entries
        net/core: revert "net: fix __netdev_update_features return.." and add comment
        af_unix: take receive queue lock while appending new skb
        rtnetlink: fix frame size warning in rtnl_fill_ifinfo
        net: use skb_clone to avoid alloc_pages failure.
        packet: Use PAGE_ALIGNED macro
        packet: Don't check frames_per_block against negative values
        net: phy: Use interrupts when available in NOLINK state
        phy: marvell: Add support for 88E1540 PHY
        arm64: bpf: make BPF prologue and epilogue align with ARM64 AAPCS
        macvlan: fix leak in macvlan_handle_frame
        ipvlan: fix use after free of skb
        ipvlan: fix leak in ipvlan_rcv_frame
        vlan: Do not put vlan headers back on bridge and macvlan ports
        vlan: Fix untag operations of stacked vlans with REORDER_HEADER off
        via-velocity: unconditionally drop frames with bad l2 length
        ipg: Remove ipg driver
        dl2k: Add support for IP1000A-based cards
        snmp: Remove duplicate OUTMCAST stat increment
        net: thunder: Check for driver data in nicvf_remove()
        ...
      7f151f1d
    • O
      MAINTAINERS: Update Mellanox's Eth NIC driver entries · e7523a49
      Or Gerlitz 提交于
      Eugenia (Jenny) Emantayev is replacing Amir Vadai as the
      mlx4 Ethernet driver maintainer.
      
      Saeed Mahameed is assigned to maintain mlx5 Eth functionality.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7523a49
    • N
      net/core: revert "net: fix __netdev_update_features return.." and add comment · 17b85d29
      Nikolay Aleksandrov 提交于
      This reverts commit 00ee5927 ("net: fix __netdev_update_features return
      on ndo_set_features failure")
      and adds a comment explaining why it's okay to return a value other than
      0 upon error. Some drivers might actually change flags and return an
      error so it's better to fire a spurious notification rather than miss
      these.
      
      CC: Michał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17b85d29
    • H
      af_unix: take receive queue lock while appending new skb · a3a116e0
      Hannes Frederic Sowa 提交于
      While possibly in future we don't necessarily need to use
      sk_buff_head.lock this is a rather larger change, as it affects the
      af_unix fd garbage collector, diag and socket cleanups. This is too much
      for a stable patch.
      
      For the time being grab sk_buff_head.lock without disabling bh and irqs,
      so don't use locked skb_queue_tail.
      
      Fixes: 869e7c62 ("net: af_unix: implement stream sendpage support")
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Reported-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3a116e0
    • H
      rtnetlink: fix frame size warning in rtnl_fill_ifinfo · b22b941b
      Hannes Frederic Sowa 提交于
      Fix the following warning:
      
        CC      net/core/rtnetlink.o
      net/core/rtnetlink.c: In function ‘rtnl_fill_ifinfo’:
      net/core/rtnetlink.c:1308:1: warning: the frame size of 2864 bytes is larger than 2048 bytes [-Wframe-larger-than=]
       }
       ^
      by splitting up the huge rtnl_fill_ifinfo into some smaller ones, so we
      don't have the huge frame allocations at the same time.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b22b941b
    • M
      net: use skb_clone to avoid alloc_pages failure. · 19125c1a
      Martin Zhang 提交于
      1. new skb only need dst and ip address(v4 or v6).
      2. skb_copy may need high order pages, which is very rare on long running server.
      Signed-off-by: NJunwei Zhang <linggao.zjw@alibaba-inc.com>
      Signed-off-by: NMartin Zhang <martinbj2008@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19125c1a
    • T
      packet: Use PAGE_ALIGNED macro · 90836b67
      Tobias Klauser 提交于
      Use PAGE_ALIGNED(...) instead of open-coding it.
      Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90836b67
    • T
      packet: Don't check frames_per_block against negative values · 4194b491
      Tobias Klauser 提交于
      rb->frames_per_block is an unsigned int, thus can never be negative.
      
      Also fix spacing in the calculation of frames_per_block.
      Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4194b491
    • A
      net: phy: Use interrupts when available in NOLINK state · 321beec5
      Andrew Lunn 提交于
      The NOLINK state will poll the phy once a second to see if the link
      has come up. If the phy has an interrupt line, this polling can be
      skipped, since the phy should interrupt when the link returns.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      321beec5
    • A
      phy: marvell: Add support for 88E1540 PHY · 819ec8e1
      Andrew Lunn 提交于
      The 88E1540 can be found embedded in the Marvell 88E6352 switch.  It
      is compatible with the 88E1510, so add support for it, using the
      88E1510 specific functions.
      Signed-off-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      819ec8e1
    • Y
      arm64: bpf: make BPF prologue and epilogue align with ARM64 AAPCS · ec0738db
      Yang Shi 提交于
      Save and restore FP/LR in BPF prog prologue and epilogue, save SP to FP
      in prologue in order to get the correct stack backtrace.
      
      However, ARM64 JIT used FP (x29) as eBPF fp register, FP is subjected to
      change during function call so it may cause the BPF prog stack base address
      change too.
      
      Use x25 to replace FP as BPF stack base register (fp). Since x25 is callee
      saved register, so it will keep intact during function call.
      It is initialized in BPF prog prologue when BPF prog is started to run
      everytime. Save and restore x25/x26 in BPF prologue and epilogue to keep
      them intact for the outside of BPF. Actually, x26 is unnecessary, but SP
      requires 16 bytes alignment.
      
      So, the BPF stack layout looks like:
      
                                       high
               original A64_SP =>   0:+-----+ BPF prologue
                                      |FP/LR|
               current A64_FP =>  -16:+-----+
                                      | ... | callee saved registers
                                      +-----+
                                      |     | x25/x26
               BPF fp register => -80:+-----+
                                      |     |
                                      | ... | BPF prog stack
                                      |     |
                                      |     |
               current A64_SP =>      +-----+
                                      |     |
                                      | ... | Function call stack
                                      |     |
                                      +-----+
                                        low
      
      CC: Zi Shen Lim <zlim.lnx@gmail.com>
      CC: Xi Wang <xi.wang@gmail.com>
      Signed-off-by: NYang Shi <yang.shi@linaro.org>
      Acked-by: NZi Shen Lim <zlim.lnx@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec0738db
    • S
      macvlan: fix leak in macvlan_handle_frame · e639b8d8
      Sabrina Dubroca 提交于
      Reset pskb in macvlan_handle_frame in case skb_share_check returned a
      clone.
      
      Fixes: 8a4eb573 ("net: introduce rx_handler results and logic around that")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e639b8d8
    • S
      ipvlan: fix use after free of skb · a534dc52
      Sabrina Dubroca 提交于
      ipvlan_handle_frame is a rx_handler, and when it returns a value other
      than RX_HANDLER_CONSUMED (here, NET_RX_DROP aka RX_HANDLER_ANOTHER),
      __netif_receive_skb_core expects that the skb still exists and will
      process it further, but we just freed it.
      
      Fixes: 2ad7bf36 ("ipvlan: Initial check-in of the IPVLAN driver.")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a534dc52
    • S
      ipvlan: fix leak in ipvlan_rcv_frame · cf554ada
      Sabrina Dubroca 提交于
      Pass a **skb to ipvlan_rcv_frame so that if skb_share_check returns a
      new skb, we actually use it during further processing.
      
      It's safe to ignore the new skb in the ipvlan_xmit_* functions, because
      they call ipvlan_rcv_frame with local == true, so that dev_forward_skb
      is called and always takes ownership of the skb.
      
      Fixes: 2ad7bf36 ("ipvlan: Initial check-in of the IPVLAN driver.")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf554ada
    • D
      Merge branch 'vlan-reorder' · eb3f8b42
      David S. Miller 提交于
      Vladislav Yasevich says:
      
      ====================
      Fix issues with vlans without REORDER_HEADER
      
      A while ago Phil Sutter brought up an issue with vlans without
      REORDER_HEADER and bridges.  The problem was that if a vlan
      without REORDER_HEADER was a port in the bridge, the bridge ended
      up forwarding corrupted packets that still contained the vlan header.
      The same issue exists for bridge mode macvlan/macvtap devices.
      
      An additional issue with vlans without REORDER_HEADER is that stacking
      them also doesn't work.  The reason here is that skb_reorder_vlan_header()
      function assumes that it on ETH_HLEN bytes deep into the packet.  That
      is not the case, when you a vlan without REORRDER_HEADER flag set.
      
      This series attempts to correct these 2 issues.
      
      1) To solve the stacked vlans problem, the patch simply use
      skb->mac_len as an offset to start copying mac addresses that
      is part of header reordering.
      
      2) To fix the issue with bridge/macvlan/macvtap, the second patch
      simply doesn't write the vlan header back to the packet if the
      vlan device is either a bridge or a macvlan port.  This ends up
      being the simplest and least performance intrussive solution.
      
      I've considered extending patch 2 to all stacked devices (essentially
      checked for the presense of rx_handler), but that feels like a broader
      restriction and _may_ break existing uses.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb3f8b42
    • V
      vlan: Do not put vlan headers back on bridge and macvlan ports · 28f9ee22
      Vlad Yasevich 提交于
      When a vlan is configured with REORDER_HEADER set to 0, the vlan
      header is put back into the packet and makes it appear that
      the vlan header is still there even after it's been processed.
      This posses a problem for bridge and macvlan ports.  The packets
      passed to those device may be forwarded and at the time of the
      forward, vlan headers end up being unexpectedly present.
      
      With the patch, we make sure that we do not put the vlan header
      back (when REORDER_HEADER is 0) if a bridge or macvlan has
      been configured on top of the vlan device.
      Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28f9ee22
    • V
      vlan: Fix untag operations of stacked vlans with REORDER_HEADER off · a6e18ff1
      Vlad Yasevich 提交于
      When we have multiple stacked vlan devices all of which have
      turned off REORDER_HEADER flag, the untag operation does not
      locate the ethernet addresses correctly for nested vlans.
      The reason is that in case of REORDER_HEADER flag being off,
      the outer vlan headers are put back and the mac_len is adjusted
      to account for the presense of the header.  Then, the subsequent
      untag operation, for the next level vlan, always use VLAN_ETH_HLEN
      to locate the begining of the ethernet header and that ends up
      being a multiple of 4 bytes short of the actuall beginning
      of the mac header (the multiple depending on the how many vlan
      encapsulations ethere are).
      
      As a reslult, if there are multiple levles of vlan devices
      with REODER_HEADER being off, the recevied packets end up
      being dropped.
      
      To solve this, we use skb->mac_len as the offset.  The value
      is always set on receive path and starts out as a ETH_HLEN.
      The value is also updated when the vlan header manupations occur
      so we know it will be correct.
      Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6e18ff1
    • T
      via-velocity: unconditionally drop frames with bad l2 length · 6c606fa3
      Timo Teräs 提交于
      By default the driver allowed incorrect frames to be received. What is
      worse the code does not handle very short frames correctly. The FCS
      length is unconditionally subtracted, and the underflow can cause
      skb_put to be called with large number after implicit cast to unsigned.
      And indeed, an skb_over_panic() was observed with via-velocity.
      
      This removes the module parameter as it does not work in it's
      current state, and should be implemented via NETIF_F_RXALL if needed.
      Suggested-by: NFrancois Romieu <romieu@fr.zoreil.com>
      Signed-off-by: NTimo Teräs <timo.teras@iki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c606fa3
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · a18ab2f6
      Linus Torvalds 提交于
      Pull vfs fixes from Al Viro:
       "A fs-cache regression fix, and adding a warning about obnoxiou^W
        moderation of list given in MAINTAINERS"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        MAINTAINERS: linux-cachefs@redhat.com is moderated for non-subscribers
        FS-Cache: Add missing initialization of ret in cachefiles_write_page()
      a18ab2f6
    • L
      Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · 864f83a1
      Linus Torvalds 提交于
      Pull crypto fix from Herbert Xu:
       "This fixes a bug in the qat driver where a user-space pointer is
        dereferenced"
      
      * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: qat - don't use userspace pointer
      864f83a1
  3. 17 11月, 2015 1 次提交