1. 11 12月, 2018 6 次提交
  2. 10 12月, 2018 1 次提交
  3. 09 12月, 2018 3 次提交
    • A
      net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data · dc9d38ce
      Andrew Lunn 提交于
      The Marvell 6390 Ethernet switch family does not perform MDIO
      turnaround correctly. Many hardware MDIO bus masters don't care about
      this, but the bitbangging implementation in Linux does by default. Add
      phy_ignore_ta_mask to the platform data so that the bitbangging code
      can be told which devices are known to get TA wrong.
      
      v2
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc9d38ce
    • A
      net: phy: mdio-gpio: Add platform_data support for phy_mask · 04fa26ba
      Andrew Lunn 提交于
      It is sometimes necessary to instantiate a bit-banging MDIO bus as a
      platform device, without the aid of device tree.
      
      When device tree is being used, the bus is not scanned for devices,
      only those devices which are in device tree are probed. Without device
      tree, by default, all addresses on the bus are scanned. This may then
      find a device which is not a PHY, e.g. a switch. And the switch may
      have registers containing values which look like a PHY. So during the
      scan, a PHY device is wrongly created.
      
      After the bus has been registered, a search is made for
      mdio_board_info structures which indicates devices on the bus, and the
      driver which should be used for them. This is typically used to
      instantiate Ethernet switches from platform drivers.  However, if the
      scanning of the bus has created a PHY device at the same location as
      indicated into the board info for a switch, the switch device is not
      created, since the address is already busy.
      
      This can be avoided by setting the phy_mask of the mdio bus. This mask
      prevents addresses on the bus being scanned.
      
      v2
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04fa26ba
    • D
      Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask" · 356ff8a9
      David Rientjes 提交于
      This reverts commit 89c83fb5.
      
      This should have been done as part of 2f0799a0 ("mm, thp: restore
      node-local hugepage allocations").  The movement of the thp allocation
      policy from alloc_pages_vma() to alloc_hugepage_direct_gfpmask() was
      intended to only set __GFP_THISNODE for mempolicies that are not
      MPOL_BIND whereas the revert could set this regardless of mempolicy.
      
      While the check for MPOL_BIND between alloc_hugepage_direct_gfpmask()
      and alloc_pages_vma() was racy, that has since been removed since the
      revert.  What is left is the possibility to use __GFP_THISNODE in
      policy_node() when it is unexpected because the special handling for
      hugepages in alloc_pages_vma()  was removed as part of the consolidation.
      
      Secondly, prior to 89c83fb5, alloc_pages_vma() implemented a somewhat
      different policy for hugepage allocations, which were allocated through
      alloc_hugepage_vma().  For hugepage allocations, if the allocating
      process's node is in the set of allowed nodes, allocate with
      __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with
      __GFP_THISNODE instead).  This was changed for shmem_alloc_hugepage() to
      allow fallback to other nodes in 89c83fb5 as it did for new_page() in
      mm/mempolicy.c which is functionally different behavior and removes the
      requirement to only allocate hugepages locally.
      
      So this commit does a full revert of 89c83fb5 instead of the partial
      revert that was done in 2f0799a0.  The result is the same thp
      allocation policy for 4.20 that was in 4.19.
      
      Fixes: 89c83fb5 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
      Fixes: 2f0799a0 ("mm, thp: restore node-local hugepage allocations")
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      356ff8a9
  4. 08 12月, 2018 5 次提交
    • S
      neighbour: Avoid writing before skb->head in neigh_hh_output() · e6ac64d4
      Stefano Brivio 提交于
      While skb_push() makes the kernel panic if the skb headroom is less than
      the unaligned hardware header size, it will proceed normally in case we
      copy more than that because of alignment, and we'll silently corrupt
      adjacent slabs.
      
      In the case fixed by the previous patch,
      "ipv6: Check available headroom in ip6_xmit() even without options", we
      end up in neigh_hh_output() with 14 bytes headroom, 14 bytes hardware
      header and write 16 bytes, starting 2 bytes before the allocated buffer.
      
      Always check we're not writing before skb->head and, if the headroom is
      not enough, warn and drop the packet.
      
      v2:
       - instead of panicking with BUG_ON(), WARN_ON_ONCE() and drop the packet
         (Eric Dumazet)
       - if we avoid the panic, though, we need to explicitly check the headroom
         before the memcpy(), otherwise we'll have corrupted slabs on a running
         kernel, after we warn
       - use __skb_push() instead of skb_push(), as the headroom check is
         already implemented here explicitly (Eric Dumazet)
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6ac64d4
    • D
      neighbor: Improve garbage collection · 58956317
      David Ahern 提交于
      The existing garbage collection algorithm has a number of problems:
      
      1. The gc algorithm will not evict PERMANENT entries as those entries
         are managed by userspace, yet the existing algorithm walks the entire
         hash table which means it always considers PERMANENT entries when
         looking for entries to evict. In some use cases (e.g., EVPN) there
         can be tens of thousands of PERMANENT entries leading to wasted
         CPU cycles when gc kicks in. As an example, with 32k permanent
         entries, neigh_alloc has been observed taking more than 4 msec per
         invocation.
      
      2. Currently, when the number of neighbor entries hits gc_thresh2 and
         the last flush for the table was more than 5 seconds ago gc kicks in
         walks the entire hash table evicting *all* entries not in PERMANENT
         or REACHABLE state and not marked as externally learned. There is no
         discriminator on when the neigh entry was created or if it just moved
         from REACHABLE to another NUD_VALID state (e.g., NUD_STALE).
      
         It is possible for entries to be created or for established neighbor
         entries to be moved to STALE (e.g., an external node sends an ARP
         request) right before the 5 second window lapses:
      
              -----|---------x|----------|-----
                  t-5         t         t+5
      
         If that happens those entries are evicted during gc causing unnecessary
         thrashing on neighbor entries and userspace caches trying to track them.
      
         Further, this contradicts the description of gc_thresh2 which says
         "Entries older than 5 seconds will be cleared".
      
         One workaround is to make gc_thresh2 == gc_thresh3 but that negates the
         whole point of having separate thresholds.
      
      3. Clearing *all* neigh non-PERMANENT/REACHABLE/externally learned entries
         when gc_thresh2 is exceeded is over kill and contributes to trashing
         especially during startup.
      
      This patch addresses these problems as follows:
      
      1. Use of a separate list_head to track entries that can be garbage
         collected along with a separate counter. PERMANENT entries are not
         added to this list.
      
         The gc_thresh parameters are only compared to the new counter, not the
         total entries in the table. The forced_gc function is updated to only
         walk this new gc_list looking for entries to evict.
      
      2. Entries are added to the list head at the tail and removed from the
         front.
      
      3. Entries are only evicted if they were last updated more than 5 seconds
         ago, adhering to the original intent of gc_thresh2.
      
      4. Forced gc is stopped once the number of gc_entries drops below
         gc_thresh2.
      
      5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped
         when allocating a new neighbor for a PERMANENT entry. By extension this
         means there are no explicit limits on the number of PERMANENT entries
         that can be created, but this is no different than FIB entries or FDB
         entries.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58956317
    • P
      bridge: Add br_fdb_clear_offload() · 43920edf
      Petr Machata 提交于
      When a driver unoffloads all FDB entries en bloc, it's inefficient to
      send the switchdev notification one by one. Add a helper that unsets the
      offload flag on FDB entries on a given bridge port and VLAN.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      43920edf
    • P
      vxlan: Add vxlan_fdb_clear_offload() · e5ff4b19
      Petr Machata 提交于
      When a driver unoffloads all FDB entries en bloc, it's inefficient to
      send the switchdev notification one by one. Add a helper that walks the
      FDB table, unsetting the offload flag on RDST with a given VNI.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e5ff4b19
    • P
      vxlan: Add vxlan_fdb_replay() · 4f89f5b5
      Petr Machata 提交于
      When a VXLAN device becomes relevant to a driver (such as when it is
      attached to an offloaded bridge), the driver will generally need to walk
      the existing FDB entries and offload them.
      
      Add a function vxlan_fdb_replay() to call a given notifier block for
      each FDB entry with a given VNI.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f89f5b5
  5. 07 12月, 2018 5 次提交
  6. 06 12月, 2018 8 次提交
  7. 05 12月, 2018 4 次提交
    • J
      USB: serial: console: fix reported terminal settings · f51ccf46
      Johan Hovold 提交于
      The USB-serial console implementation has never reported the actual
      terminal settings used. Despite storing the corresponding cflags in its
      struct console, these were never honoured on later tty open() where the
      tty termios would be left initialised to the driver defaults.
      
      Unlike the serial console implementation, the USB-serial code calls
      subdriver open() already at console setup. While calling set_termios()
      and write() before open() looks like it could work for some USB-serial
      drivers, others definitely do not expect this, so modelling this after
      serial core is going to be intrusive, if at all possible.
      
      Instead, use a (renamed) tty helper to save the termios data used at
      console setup so that the tty termios reflects the actual terminal
      settings after a subsequent tty open().
      
      Note that the calls to tty_init_termios() (tty_driver_install()) and
      tty_save_termios() are serialised using the disconnect mutex.
      
      This specifically fixes a regression that was triggered by a recent
      change adding software flow control to the pl2303 driver: a getty trying
      to disable flow control while leaving the baud rate unchanged would now
      also set the baud rate to the driver default (prior to the flow-control
      change this had been a noop).
      
      Fixes: 7041d9c3 ("USB: serial: pl2303: add support for tx xon/xoff flow control")
      Cc: stable <stable@vger.kernel.org>	# 4.18
      Cc: Florian Zumbiehl <florz@florz.de>
      Reported-by: NJarkko Nikula <jarkko.nikula@linux.intel.com>
      Tested-by: NJarkko Nikula <jarkko.nikula@linux.intel.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJohan Hovold <johan@kernel.org>
      f51ccf46
    • M
      dax: Fix unlock mismatch with updated API · 27359fd6
      Matthew Wilcox 提交于
      Internal to dax_unlock_mapping_entry(), dax_unlock_entry() is used to
      store a replacement entry in the Xarray at the given xas-index with the
      DAX_LOCKED bit clear. When called, dax_unlock_entry() expects the unlocked
      value of the entry relative to the current Xarray state to be specified.
      
      In most contexts dax_unlock_entry() is operating in the same scope as
      the matched dax_lock_entry(). However, in the dax_unlock_mapping_entry()
      case the implementation needs to recall the original entry. In the case
      where the original entry is a 'pmd' entry it is possible that the pfn
      performed to do the lookup is misaligned to the value retrieved in the
      Xarray.
      
      Change the api to return the unlock cookie from dax_lock_page() and pass
      it to dax_unlock_page(). This fixes a bug where dax_unlock_page() was
      assuming that the page was PMD-aligned if the entry was a PMD entry with
      signatures like:
      
       WARNING: CPU: 38 PID: 1396 at fs/dax.c:340 dax_insert_entry+0x2b2/0x2d0
       RIP: 0010:dax_insert_entry+0x2b2/0x2d0
       [..]
       Call Trace:
        dax_iomap_pte_fault.isra.41+0x791/0xde0
        ext4_dax_huge_fault+0x16f/0x1f0
        ? up_read+0x1c/0xa0
        __do_fault+0x1f/0x160
        __handle_mm_fault+0x1033/0x1490
        handle_mm_fault+0x18b/0x3d0
      
      Link: https://lkml.kernel.org/r/20181130154902.GL10377@bombadil.infradead.org
      Fixes: 9f32d221 ("dax: Convert dax_lock_mapping_entry to XArray")
      Reported-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      27359fd6
    • E
      tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWAT · a74f0fa0
      Eric Dumazet 提交于
      TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12
      as a step to enable bigger tcp sndbuf limits.
      
      It works reasonably well, but the following happens :
      
      Once the limit is reached, TCP stack generates
      an [E]POLLOUT event for every incoming ACK packet.
      
      This causes a high number of context switches.
      
      This patch implements the strategy David Miller added
      in sock_def_write_space() :
      
       - If TCP socket has a notsent_lowat constraint of X bytes,
         allow sendmsg() to fill up to X bytes, but send [E]POLLOUT
         only if number of notsent bytes is below X/2
      
      This considerably reduces TCP_NOTSENT_LOWAT overhead,
      while allowing to keep the pipe full.
      
      Tested:
       100 ms RTT netem testbed between A and B, 100 concurrent TCP_STREAM
      
      A:/# cat /proc/sys/net/ipv4/tcp_wmem
      4096	262144	64000000
      A:/# super_netperf 100 -H B -l 1000 -- -K bbr &
      
      A:/# grep TCP /proc/net/sockstat
      TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 1364904 # This is about 54 MB of memory per flow :/
      
      A:/# vmstat 5 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       0  0      0 256220672  13532 694976    0    0    10     0   28   14  0  1 99  0  0
       2  0      0 256320016  13532 698480    0    0   512     0 715901 5927  0 10 90  0  0
       0  0      0 256197232  13532 700992    0    0   735    13 771161 5849  0 11 89  0  0
       1  0      0 256233824  13532 703320    0    0   512    23 719650 6635  0 11 89  0  0
       2  0      0 256226880  13532 705780    0    0   642     4 775650 6009  0 12 88  0  0
      
      A:/# echo 2097152 >/proc/sys/net/ipv4/tcp_notsent_lowat
      
      A:/# grep TCP /proc/net/sockstat
      TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 86411 # 3.5 MB per flow
      
      A:/# vmstat 5 5  # check that context switches have not inflated too much.
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       2  0      0 260386512  13592 662148    0    0    10     0   17   14  0  1 99  0  0
       0  0      0 260519680  13592 604184    0    0   512    13 726843 12424  0 10 90  0  0
       1  1      0 260435424  13592 598360    0    0   512    25 764645 12925  0 10 90  0  0
       1  0      0 260855392  13592 578380    0    0   512     7 722943 13624  0 11 88  0  0
       1  0      0 260445008  13592 601176    0    0   614    34 772288 14317  0 10 90  0  0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a74f0fa0
    • I
      skbuff: Rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark' · 875e8939
      Ido Schimmel 提交于
      Commit abf4bb6b ("skbuff: Add the offload_mr_fwd_mark field") added
      the 'offload_mr_fwd_mark' field to indicate that a packet has already
      undergone L3 multicast routing by a capable device. The field is used to
      prevent the kernel from forwarding a packet through a netdev through
      which the device has already forwarded the packet.
      
      Currently, no unicast packet is routed by both the device and the
      kernel, but this is about to change by subsequent patches and we need to
      be able to mark such packets, so that they will no be forwarded twice.
      
      Instead of adding yet another field to 'struct sk_buff', we can just
      rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark', as a packet
      either has a multicast or a unicast destination IP.
      
      While at it, add a comment about both 'offload_fwd_mark' and
      'offload_l3_fwd_mark'.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      875e8939
  8. 04 12月, 2018 8 次提交