1. 26 3月, 2018 2 次提交
  2. 23 3月, 2018 2 次提交
    • D
      Revert "mm: page_alloc: skip over regions of invalid pfns where possible" · f59f1caf
      Daniel Vacek 提交于
      This reverts commit b92df1de ("mm: page_alloc: skip over regions of
      invalid pfns where possible").  The commit is meant to be a boot init
      speed up skipping the loop in memmap_init_zone() for invalid pfns.
      
      But given some specific memory mapping on x86_64 (or more generally
      theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the
      implementation also skips valid pfns which is plain wrong and causes
      'kernel BUG at mm/page_alloc.c:1389!'
      
        crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
        kernel BUG at mm/page_alloc.c:1389!
        invalid opcode: 0000 [#1] SMP
        --
        RIP: 0010: move_freepages+0x15e/0x160
        --
        Call Trace:
          move_freepages_block+0x73/0x80
          __rmqueue+0x263/0x460
          get_page_from_freelist+0x7e1/0x9e0
          __alloc_pages_nodemask+0x176/0x420
        --
      
        crash> page_init_bug -v | grep RAM
        <struct resource 0xffff88067fffd2f8>          1000 -        9bfff       System RAM (620.00 KiB)
        <struct resource 0xffff88067fffd3a0>        100000 -     430bffff       System RAM (  1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
        <struct resource 0xffff88067fffd410>      4b0c8000 -     4bf9cfff       System RAM ( 14.83 MiB = 15188.00 KiB)
        <struct resource 0xffff88067fffd480>      4bfac000 -     646b1fff       System RAM (391.02 MiB = 400408.00 KiB)
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff       System RAM (480.00 KiB)
        <struct resource 0xffff88067fffd640>     100000000 -    67fffffff       System RAM ( 22.00 GiB)
      
        crash> page_init_bug | head -6
        <struct resource 0xffff88067fffd560>      7b788000 -     7b7fffff       System RAM (480.00 KiB)
        <struct page 0xffffea0001ede200>   1fffff00000000  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        <struct page 0xffffea0001ede200>       505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0>
        <struct page 0xffffea0001ed8000>                0  0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA               1       4095
        <struct page 0xffffea0001edffc0>   1fffff00000400  0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32          4096    1048575
        BUG, zones differ!
      
        crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
              PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
        ffffea0001e00000  78000000                0        0  0 0
        ffffea0001ed7fc0  7b5ff000                0        0  0 0
        ffffea0001ed8000  7b600000                0        0  0 0       <<<<
        ffffea0001ede1c0  7b787000                0        0  0 0
        ffffea0001ede200  7b788000                0        0  1 1fffff00000000
      
      Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com
      Fixes: b92df1de ("mm: page_alloc: skip over regions of invalid pfns where possible")
      Signed-off-by: NDaniel Vacek <neelx@redhat.com>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f59f1caf
    • T
      mm/vmalloc: add interfaces to free unmapped page table · b6bdb751
      Toshi Kani 提交于
      On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may
      create pud/pmd mappings.  A kernel panic was observed on arm64 systems
      with Cortex-A75 in the following steps as described by Hanjun Guo.
      
       1. ioremap a 4K size, valid page table will build,
       2. iounmap it, pte0 will set to 0;
       3. ioremap the same address with 2M size, pgd/pmd is unchanged,
          then set the a new value for pmd;
       4. pte0 is leaked;
       5. CPU may meet exception because the old pmd is still in TLB,
          which will lead to kernel panic.
      
      This panic is not reproducible on x86.  INVLPG, called from iounmap,
      purges all levels of entries associated with purged address on x86.  x86
      still has memory leak.
      
      The patch changes the ioremap path to free unmapped page table(s) since
      doing so in the unmap path has the following issues:
      
       - The iounmap() path is shared with vunmap(). Since vmap() only
         supports pte mappings, making vunmap() to free a pte page is an
         overhead for regular vmap users as they do not need a pte page freed
         up.
      
       - Checking if all entries in a pte page are cleared in the unmap path
         is racy, and serializing this check is expensive.
      
       - The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
         Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
         purge.
      
      Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which
      clear a given pud/pmd entry and free up a page for the lower level
      entries.
      
      This patch implements their stub functions on x86 and arm64, which work
      as workaround.
      
      [akpm@linux-foundation.org: fix typo in pmd_free_pte_page() stub]
      Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com
      Fixes: e61ce6ad ("mm: change ioremap to set up huge I/O mappings")
      Reported-by: NLei Li <lious.lilei@hisilicon.com>
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Wang Xuefeng <wxf.wang@hisilicon.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6bdb751
  3. 22 3月, 2018 1 次提交
  4. 21 3月, 2018 2 次提交
  5. 20 3月, 2018 3 次提交
    • J
      jump_label: Disable jump labels in __exit code · 578ae447
      Josh Poimboeuf 提交于
      With the following commit:
      
        33352244 ("jump_label: Explicitly disable jump labels in __init code")
      
      ... we explicitly disabled jump labels in __init code, so they could be
      detected and not warned about in the following commit:
      
        dc1dd184 ("jump_label: Warn on failed jump_label patching attempt")
      
      In-kernel __exit code has the same issue.  It's never used, so it's
      freed along with the rest of initmem.  But jump label entries in __exit
      code aren't explicitly disabled, so we get the following warning when
      enabling pr_debug() in __exit code:
      
        can't patch jump_label at dmi_sysfs_exit+0x0/0x2d
        WARNING: CPU: 0 PID: 22572 at kernel/jump_label.c:376 __jump_label_update+0x9d/0xb0
      
      Fix the warning by disabling all jump labels in initmem (which includes
      both __init and __exit code).
      Reported-and-tested-by: NLi Wang <liwang@redhat.com>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: dc1dd184 ("jump_label: Warn on failed jump_label patching attempt")
      Link: http://lkml.kernel.org/r/7121e6e595374f06616c505b6e690e275c0054d1.1521483452.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      578ae447
    • L
      RDMA/verbs: Remove restrack entry from XRCD structure · 80cf79ae
      Leon Romanovsky 提交于
      XRCD object is not implemented in the restrack, so lets remove it.
      
      Fixes: 02d8883f ("RDMA/restrack: Add general infrastructure to track RDMA resources")
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      80cf79ae
    • T
      percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods · b3a5d111
      Tejun Heo 提交于
      percpu_ref internally uses sched-RCU to implement the percpu -> atomic
      mode switching and the documentation suggested that this could be
      depended upon.  This doesn't seem like a good idea.
      
      * percpu_ref uses sched-RCU which has different grace periods regular
        RCU.  Users may combine percpu_ref with regular RCU usage and
        incorrectly believe that regular RCU grace periods are performed by
        percpu_ref.  This can lead to, for example, use-after-free due to
        premature freeing.
      
      * percpu_ref has a grace period when switching from percpu to atomic
        mode.  It doesn't have one between the last put and release.  This
        distinction is subtle and can lead to surprising bugs.
      
      * percpu_ref allows starting in and switching to atomic mode manually
        for debugging and other purposes.  This means that there may not be
        any grace periods from kill to release.
      
      This patch makes it clear that the grace periods are percpu_ref's
      internal implementation detail and can't be depended upon by the
      users.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b3a5d111
  6. 19 3月, 2018 1 次提交
  7. 16 3月, 2018 3 次提交
    • T
      vlan: Fix out of order vlan headers with reorder header off · cbe7128c
      Toshiaki Makita 提交于
      With reorder header off, received packets are untagged in skb_vlan_untag()
      called from within __netif_receive_skb_core(), and later the tag will be
      inserted back in vlan_do_receive().
      
      This caused out of order vlan headers when we create a vlan device on top
      of another vlan device, because vlan_do_receive() inserts a tag as the
      outermost vlan tag. E.g. the outer tag is first removed in skb_vlan_untag()
      and inserted back in vlan_do_receive(), then the inner tag is next removed
      and inserted back as the outermost tag.
      
      This patch fixes the behaviour by inserting the inner tag at the right
      position.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbe7128c
    • T
      net: Fix vlan untag for bridge and vlan_dev with reorder_hdr off · 4bbb3e0e
      Toshiaki Makita 提交于
      When we have a bridge with vlan_filtering on and a vlan device on top of
      it, packets would be corrupted in skb_vlan_untag() called from
      br_dev_xmit().
      
      The problem sits in skb_reorder_vlan_header() used in skb_vlan_untag(),
      which makes use of skb->mac_len. In this function mac_len is meant for
      handling rx path with vlan devices with reorder_header disabled, but in
      tx path mac_len is typically 0 and cannot be used, which is the problem
      in this case.
      
      The current code even does not properly handle rx path (skb_vlan_untag()
      called from __netif_receive_skb_core()) with reorder_header off actually.
      
      In rx path single tag case, it works as follows:
      
      - Before skb_reorder_vlan_header()
      
       mac_header                                data
         v                                        v
         +-------------------+-------------+------+----
         |        ETH        |    VLAN     | ETH  |
         |       ADDRS       | TPID | TCI  | TYPE |
         +-------------------+-------------+------+----
         <-------- mac_len --------->
                             <------------->
                              to be removed
      
      - After skb_reorder_vlan_header()
      
                  mac_header                     data
                       v                          v
                       +-------------------+------+----
                       |        ETH        | ETH  |
                       |       ADDRS       | TYPE |
                       +-------------------+------+----
                       <-------- mac_len --------->
      
      This is ok, but in rx double tag case, it corrupts packets:
      
      - Before skb_reorder_vlan_header()
      
       mac_header                                              data
         v                                                      v
         +-------------------+-------------+-------------+------+----
         |        ETH        |    VLAN     |    VLAN     | ETH  |
         |       ADDRS       | TPID | TCI  | TPID | TCI  | TYPE |
         +-------------------+-------------+-------------+------+----
         <--------------- mac_len ---------------->
                                           <------------->
                                          should be removed
                             <--------------------------->
                               actually will be removed
      
      - After skb_reorder_vlan_header()
      
                  mac_header                                   data
                       v                                        v
                                     +-------------------+------+----
                                     |        ETH        | ETH  |
                                     |       ADDRS       | TYPE |
                                     +-------------------+------+----
                       <--------------- mac_len ---------------->
      
      So, two of vlan tags are both removed while only inner one should be
      removed and mac_header (and mac_len) is broken.
      
      skb_vlan_untag() is meant for removing the vlan header at (skb->data - 2),
      so use skb->data and skb->mac_header to calculate the right offset.
      Reported-by: NBrandon Carpenter <brandon.carpenter@cypherpath.com>
      Fixes: a6e18ff1 ("vlan: Fix untag operations of stacked vlans with REORDER_HEADER off")
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bbb3e0e
    • E
      fs: Teach path_connected to handle nfs filesystems with multiple roots. · 95dd7758
      Eric W. Biederman 提交于
      On nfsv2 and nfsv3 the nfs server can export subsets of the same
      filesystem and report the same filesystem identifier, so that the nfs
      client can know they are the same filesystem.  The subsets can be from
      disjoint directory trees.  The nfsv2 and nfsv3 filesystems provides no
      way to find the common root of all directory trees exported form the
      server with the same filesystem identifier.
      
      The practical result is that in struct super s_root for nfs s_root is
      not necessarily the root of the filesystem.  The nfs mount code sets
      s_root to the root of the first subset of the nfs filesystem that the
      kernel mounts.
      
      This effects the dcache invalidation code in generic_shutdown_super
      currently called shrunk_dcache_for_umount and that code for years
      has gone through an additional list of dentries that might be dentry
      trees that need to be freed to accomodate nfs.
      
      When I wrote path_connected I did not realize nfs was so special, and
      it's hueristic for avoiding calling is_subdir can fail.
      
      The practical case where this fails is when there is a move of a
      directory from the subtree exposed by one nfs mount to the subtree
      exposed by another nfs mount.  This move can happen either locally or
      remotely.  With the remote case requiring that the move directory be cached
      before the move and that after the move someone walks the path
      to where the move directory now exists and in so doing causes the
      already cached directory to be moved in the dcache through the magic
      of d_splice_alias.
      
      If someone whose working directory is in the move directory or a
      subdirectory and now starts calling .. from the initial mount of nfs
      (where s_root == mnt_root), then path_connected as a heuristic will
      not bother with the is_subdir check.  As s_root really is not the root
      of the nfs filesystem this heuristic is wrong, and the path may
      actually not be connected and path_connected can fail.
      
      The is_subdir function might be cheap enough that we can call it
      unconditionally.  Verifying that will take some benchmarking and
      the result may not be the same on all kernels this fix needs
      to be backported to.  So I am avoiding that for now.
      
      Filesystems with snapshots such as nilfs and btrfs do something
      similar.  But as the directory tree of the snapshots are disjoint
      from one another and from the main directory tree rename won't move
      things between them and this problem will not occur.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Fixes: 397d425d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      95dd7758
  8. 15 3月, 2018 5 次提交
    • A
      mmc: core: Fix tracepoint print of blk_addr and blksz · c658dc58
      Adrian Hunter 提交于
      Swap the positions of blk_addr and blksz in the tracepoint print arguments
      so that they match the print format.
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Fixes: d2f82254 ("mmc: core: Add members to mmc_request and mmc_data for CQE's")
      Cc: <stable@vger.kernel.org> # 4.14+
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      c658dc58
    • M
      KVM: arm/arm64: vgic: Don't populate multiple LRs with the same vintid · 16ca6a60
      Marc Zyngier 提交于
      The vgic code is trying to be clever when injecting GICv2 SGIs,
      and will happily populate LRs with the same interrupt number if
      they come from multiple vcpus (after all, they are distinct
      interrupt sources).
      
      Unfortunately, this is against the letter of the architecture,
      and the GICv2 architecture spec says "Each valid interrupt stored
      in the List registers must have a unique VirtualID for that
      virtual CPU interface.". GICv3 has similar (although slightly
      ambiguous) restrictions.
      
      This results in guests locking up when using GICv2-on-GICv3, for
      example. The obvious fix is to stop trying so hard, and inject
      a single vcpu per SGI per guest entry. After all, pending SGIs
      with multiple source vcpus are pretty rare, and are mostly seen
      in scenario where the physical CPUs are severely overcomitted.
      
      But as we now only inject a single instance of a multi-source SGI per
      vcpu entry, we may delay those interrupts for longer than strictly
      necessary, and run the risk of injecting lower priority interrupts
      in the meantime.
      
      In order to address this, we adopt a three stage strategy:
      - If we encounter a multi-source SGI in the AP list while computing
        its depth, we force the list to be sorted
      - When populating the LRs, we prevent the injection of any interrupt
        of lower priority than that of the first multi-source SGI we've
        injected.
      - Finally, the injection of a multi-source SGI triggers the request
        of a maintenance interrupt when there will be no pending interrupt
        in the LRs (HCR_NPIE).
      
      At the point where the last pending interrupt in the LRs switches
      from Pending to Active, the maintenance interrupt will be delivered,
      allowing us to add the remaining SGIs using the same process.
      
      Cc: stable@vger.kernel.org
      Fixes: 0919e84c ("KVM: arm/arm64: vgic-new: Add IRQ sync/flush framework")
      Acked-by: NChristoffer Dall <cdall@kernel.org>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      16ca6a60
    • C
      KVM: arm/arm64: Reset mapped IRQs on VM reset · 413aa807
      Christoffer Dall 提交于
      We currently don't allow resetting mapped IRQs from userspace, because
      their state is controlled by the hardware.  But we do need to reset the
      state when the VM is reset, so we provide a function for the 'owner' of
      the mapped interrupt to reset the interrupt state.
      
      Currently only the timer uses mapped interrupts, so we call this
      function from the timer reset logic.
      
      Cc: stable@vger.kernel.org
      Fixes: 4c60e360 ("KVM: arm/arm64: Provide a get_input_level for the arch timer")
      Signed-off-by: NChristoffer Dall <cdall@kernel.org>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      413aa807
    • S
      ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu · d52e5a7e
      Sabrina Dubroca 提交于
      Prior to the rework of PMTU information storage in commit
      2c8cec5c ("ipv4: Cache learned PMTU information in inetpeer."),
      when a PMTU event advertising a PMTU smaller than
      net.ipv4.route.min_pmtu was received, we would disable setting the DF
      flag on packets by locking the MTU metric, and set the PMTU to
      net.ipv4.route.min_pmtu.
      
      Since then, we don't disable DF, and set PMTU to
      net.ipv4.route.min_pmtu, so the intermediate router that has this link
      with a small MTU will have to drop the packets.
      
      This patch reestablishes pre-2.6.39 behavior by splitting
      rtable->rt_pmtu into a bitfield with rt_mtu_locked and rt_pmtu.
      rt_mtu_locked indicates that we shouldn't set the DF bit on that path,
      and is checked in ip_dont_fragment().
      
      One possible workaround is to set net.ipv4.route.min_pmtu to a value low
      enough to accommodate the lowest MTU encountered.
      
      Fixes: 2c8cec5c ("ipv4: Cache learned PMTU information in inetpeer.")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d52e5a7e
    • E
      net: use skb_to_full_sk() in skb_update_prio() · 4dcb31d4
      Eric Dumazet 提交于
      Andrei Vagin reported a KASAN: slab-out-of-bounds error in
      skb_update_prio()
      
      Since SYNACK might be attached to a request socket, we need to
      get back to the listener socket.
      Since this listener is manipulated without locks, add const
      qualifiers to sock_cgroup_prioidx() so that the const can also
      be used in skb_update_prio()
      
      Also add the const qualifier to sock_cgroup_classid() for consistency.
      
      Fixes: ca6fb065 ("tcp: attach SYNACK messages to request sockets instead of listener")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4dcb31d4
  9. 14 3月, 2018 2 次提交
  10. 12 3月, 2018 3 次提交
    • X
      sock_diag: request _diag module only when the family or proto has been registered · bf2ae2e4
      Xin Long 提交于
      Now when using 'ss' in iproute, kernel would try to load all _diag
      modules, which also causes corresponding family and proto modules
      to be loaded as well due to module dependencies.
      
      Like after running 'ss', sctp, dccp, af_packet (if it works as a module)
      would be loaded.
      
      For example:
      
        $ lsmod|grep sctp
        $ ss
        $ lsmod|grep sctp
        sctp_diag              16384  0
        sctp                  323584  5 sctp_diag
        inet_diag              24576  4 raw_diag,tcp_diag,sctp_diag,udp_diag
        libcrc32c              16384  3 nf_conntrack,nf_nat,sctp
      
      As these family and proto modules are loaded unintentionally, it
      could cause some problems, like:
      
      - Some debug tools use 'ss' to collect the socket info, which loads all
        those diag and family and protocol modules. It's noisy for identifying
        issues.
      
      - Users usually expect to drop sctp init packet silently when they
        have no sense of sctp protocol instead of sending abort back.
      
      - It wastes resources (especially with multiple netns), and SCTP module
        can't be unloaded once it's loaded.
      
      ...
      
      In short, it's really inappropriate to have these family and proto
      modules loaded unexpectedly when just doing debugging with inet_diag.
      
      This patch is to introduce sock_load_diag_module() where it loads
      the _diag module only when it's corresponding family or proto has
      been already registered.
      
      Note that we can't just load _diag module without the family or
      proto loaded, as some symbols used in _diag module are from the
      family or proto module.
      
      v1->v2:
        - move inet proto check to inet_diag to avoid a compiling err.
      v2->v3:
        - define sock_load_diag_module in sock.c and export one symbol
          only.
        - improve the changelog.
      Reported-by: NSabrina Dubroca <sd@queasysnail.net>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NPhil Sutter <phil@nwl.cc>
      Acked-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf2ae2e4
    • B
      net: phy: Tell caller result of phy_change() · a2c054a8
      Brad Mouring 提交于
      In 664fcf12 (net: phy: Threaded interrupts allow some simplification)
      the phy_interrupt system was changed to use a traditional threaded
      interrupt scheme instead of a workqueue approach.
      
      With this change, the phy status check moved into phy_change, which
      did not report back to the caller whether or not the interrupt was
      handled. This means that, in the case of a shared phy interrupt,
      only the first phydev's interrupt registers are checked (since
      phy_interrupt() would always return IRQ_HANDLED). This leads to
      interrupt storms when it is a secondary device that's actually the
      interrupt source.
      Signed-off-by: NBrad Mouring <brad.mouring@ni.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2c054a8
    • F
      netfilter: x_tables: add and use xt_check_proc_name · b1d0a5d0
      Florian Westphal 提交于
      recent and hashlimit both create /proc files, but only check that
      name is 0 terminated.
      
      This can trigger WARN() from procfs when name is "" or "/".
      Add helper for this and then use it for both.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Reported-by: <syzbot+0502b00edac2a0680b61@syzkaller.appspotmail.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b1d0a5d0
  11. 10 3月, 2018 1 次提交
  12. 08 3月, 2018 2 次提交
    • E
      net: usbnet: fix potential deadlock on 32bit hosts · 2695578b
      Eric Dumazet 提交于
      Marek reported a LOCKDEP issue occurring on 32bit host,
      that we tracked down to the fact that usbnet could either
      run from soft or hard irqs.
      
      This patch adds u64_stats_update_begin_irqsave() and
      u64_stats_update_end_irqrestore() helpers to solve this case.
      
      [   17.768040] ================================
      [   17.772239] WARNING: inconsistent lock state
      [   17.776511] 4.16.0-rc3-next-20180227-00007-g876c53a7493c #453 Not tainted
      [   17.783329] --------------------------------
      [   17.787580] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
      [   17.793607] swapper/0/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
      [   17.798751]  (&syncp->seq#5){?.-.}, at: [<9b22e5f0>]
      asix_rx_fixup_internal+0x188/0x288
      [   17.806790] {IN-HARDIRQ-W} state was registered at:
      [   17.811677]   tx_complete+0x100/0x208
      [   17.815319]   __usb_hcd_giveback_urb+0x60/0xf0
      [   17.819770]   xhci_giveback_urb_in_irq+0xa8/0x240
      [   17.824469]   xhci_td_cleanup+0xf4/0x16c
      [   17.828367]   xhci_irq+0xe74/0x2240
      [   17.831827]   usb_hcd_irq+0x24/0x38
      [   17.835343]   __handle_irq_event_percpu+0x98/0x510
      [   17.840111]   handle_irq_event_percpu+0x1c/0x58
      [   17.844623]   handle_irq_event+0x38/0x5c
      [   17.848519]   handle_fasteoi_irq+0xa4/0x138
      [   17.852681]   generic_handle_irq+0x18/0x28
      [   17.856760]   __handle_domain_irq+0x6c/0xe4
      [   17.860941]   gic_handle_irq+0x54/0xa0
      [   17.864666]   __irq_svc+0x70/0xb0
      [   17.867964]   arch_cpu_idle+0x20/0x3c
      [   17.871578]   arch_cpu_idle+0x20/0x3c
      [   17.875190]   do_idle+0x144/0x218
      [   17.878468]   cpu_startup_entry+0x18/0x1c
      [   17.882454]   start_kernel+0x394/0x400
      [   17.886177] irq event stamp: 161912
      [   17.889616] hardirqs last  enabled at (161912): [<7bedfacf>]
      __netdev_alloc_skb+0xcc/0x140
      [   17.897893] hardirqs last disabled at (161911): [<d58261d0>]
      __netdev_alloc_skb+0x94/0x140
      [   17.904903] exynos5-hsi2c 12ca0000.i2c: tx timeout
      [   17.906116] softirqs last  enabled at (161904): [<387102ff>]
      irq_enter+0x78/0x80
      [   17.906123] softirqs last disabled at (161905): [<cf4c628e>]
      irq_exit+0x134/0x158
      [   17.925722].
      [   17.925722] other info that might help us debug this:
      [   17.933435]  Possible unsafe locking scenario:
      [   17.933435].
      [   17.940331]        CPU0
      [   17.942488]        ----
      [   17.944894]   lock(&syncp->seq#5);
      [   17.948274]   <Interrupt>
      [   17.950847]     lock(&syncp->seq#5);
      [   17.954386].
      [   17.954386]  *** DEADLOCK ***
      [   17.954386].
      [   17.962422] no locks held by swapper/0/0.
      
      Fixes: c8b5d129 ("net: usbnet: support 64bit stats")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2695578b
    • A
      sch_netem: fix skb leak in netem_enqueue() · 35d889d1
      Alexey Kodanev 提交于
      When we exceed current packets limit and we have more than one
      segment in the list returned by skb_gso_segment(), netem drops
      only the first one, skipping the rest, hence kmemleak reports:
      
      unreferenced object 0xffff880b5d23b600 (size 1024):
        comm "softirq", pid 0, jiffies 4384527763 (age 2770.629s)
        hex dump (first 32 bytes):
          00 80 23 5d 0b 88 ff ff 00 00 00 00 00 00 00 00  ..#]............
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000d8a19b9d>] __alloc_skb+0xc9/0x520
          [<000000001709b32f>] skb_segment+0x8c8/0x3710
          [<00000000c7b9bb88>] tcp_gso_segment+0x331/0x1830
          [<00000000c921cba1>] inet_gso_segment+0x476/0x1370
          [<000000008b762dd4>] skb_mac_gso_segment+0x1f9/0x510
          [<000000002182660a>] __skb_gso_segment+0x1dd/0x620
          [<00000000412651b9>] netem_enqueue+0x1536/0x2590 [sch_netem]
          [<0000000005d3b2a9>] __dev_queue_xmit+0x1167/0x2120
          [<00000000fc5f7327>] ip_finish_output2+0x998/0xf00
          [<00000000d309e9d3>] ip_output+0x1aa/0x2c0
          [<000000007ecbd3a4>] tcp_transmit_skb+0x18db/0x3670
          [<0000000042d2a45f>] tcp_write_xmit+0x4d4/0x58c0
          [<0000000056a44199>] tcp_tasklet_func+0x3d9/0x540
          [<0000000013d06d02>] tasklet_action+0x1ca/0x250
          [<00000000fcde0b8b>] __do_softirq+0x1b4/0x5a3
          [<00000000e7ed027c>] irq_exit+0x1e2/0x210
      
      Fix it by adding the rest of the segments, if any, to skb 'to_free'
      list. Add new __qdisc_drop_all() and qdisc_drop_all() functions
      because they can be useful in the future if we need to drop segmented
      GSO packets in other places.
      
      Fixes: 6071bd1a ("netem: Segment GSO packets on enqueue")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35d889d1
  13. 07 3月, 2018 2 次提交
    • P
      rhashtable: Fix rhlist duplicates insertion · d3dcf8eb
      Paul Blakey 提交于
      When inserting duplicate objects (those with the same key),
      current rhlist implementation messes up the chain pointers by
      updating the bucket pointer instead of prev next pointer to the
      newly inserted node. This causes missing elements on removal and
      travesal.
      
      Fix that by properly updating pprev pointer to point to
      the correct rhash_head next pointer.
      
      Issue: 1241076
      Change-Id: I86b2c140bcb4aeb10b70a72a267ff590bb2b17e7
      Fixes: ca26893f ('rhashtable: Add rhlist interface')
      Signed-off-by: NPaul Blakey <paulb@mellanox.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3dcf8eb
    • D
      usb: quirks: add control message delay for 1b1c:1b20 · cb88a058
      Danilo Krummrich 提交于
      Corsair Strafe RGB keyboard does not respond to usb control messages
      sometimes and hence generates timeouts.
      
      Commit de3af5bf ("usb: quirks: add delay init quirk for Corsair
      Strafe RGB keyboard") tried to fix those timeouts by adding
      USB_QUIRK_DELAY_INIT.
      
      Unfortunately, even with this quirk timeouts of usb_control_msg()
      can still be seen, but with a lower frequency (approx. 1 out of 15):
      
      [   29.103520] usb 1-8: string descriptor 0 read error: -110
      [   34.363097] usb 1-8: can't set config #1, error -110
      
      Adding further delays to different locations where usb control
      messages are issued just moves the timeouts to other locations,
      e.g.:
      
      [   35.400533] usbhid 1-8:1.0: can't add hid device: -110
      [   35.401014] usbhid: probe of 1-8:1.0 failed with error -110
      
      The only way to reliably avoid those issues is having a pause after
      each usb control message. In approx. 200 boot cycles no more timeouts
      were seen.
      
      Addionaly, keep USB_QUIRK_DELAY_INIT as it turned out to be necessary
      to have the delay in hub_port_connect() after hub_port_init().
      
      The overall boot time seems not to be influenced by these additional
      delays, even on fast machines and lightweight distributions.
      
      Fixes: de3af5bf ("usb: quirks: add delay init quirk for Corsair Strafe RGB keyboard")
      Cc: stable@vger.kernel.org
      Signed-off-by: NDanilo Krummrich <danilokrummrich@dk-develop.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cb88a058
  14. 06 3月, 2018 2 次提交
  15. 05 3月, 2018 3 次提交
  16. 04 3月, 2018 1 次提交
  17. 03 3月, 2018 1 次提交
    • M
      signals: Move put_compat_sigset to compat.h to silence hardened usercopy · fde9fc76
      Matt Redfearn 提交于
      Since commit afcc90f8 ("usercopy: WARN() on slab cache usercopy
      region violations"), MIPS systems booting with a compat root filesystem
      emit a warning when copying compat siginfo to userspace:
      
      WARNING: CPU: 0 PID: 953 at mm/usercopy.c:81 usercopy_warn+0x98/0xe8
      Bad or missing usercopy whitelist? Kernel memory exposure attempt
      detected from SLAB object 'task_struct' (offset 1432, size 16)!
      Modules linked in:
      CPU: 0 PID: 953 Comm: S01logging Not tainted 4.16.0-rc2 #10
      Stack : ffffffff808c0000 0000000000000000 0000000000000001 65ac85163f3bdc4a
      	65ac85163f3bdc4a 0000000000000000 90000000ff667ab8 ffffffff808c0000
      	00000000000003f8 ffffffff808d0000 00000000000000d1 0000000000000000
      	000000000000003c 0000000000000000 ffffffff808c8ca8 ffffffff808d0000
      	ffffffff808d0000 ffffffff80810000 fffffc0000000000 ffffffff80785c30
      	0000000000000009 0000000000000051 90000000ff667eb0 90000000ff667db0
      	000000007fe0d938 0000000000000018 ffffffff80449958 0000000020052798
      	ffffffff808c0000 90000000ff664000 90000000ff667ab0 00000000100c0000
      	ffffffff80698810 0000000000000000 0000000000000000 0000000000000000
      	0000000000000000 0000000000000000 ffffffff8010d02c 65ac85163f3bdc4a
      	...
      Call Trace:
      [<ffffffff8010d02c>] show_stack+0x9c/0x130
      [<ffffffff80698810>] dump_stack+0x90/0xd0
      [<ffffffff80137b78>] __warn+0x100/0x118
      [<ffffffff80137bdc>] warn_slowpath_fmt+0x4c/0x70
      [<ffffffff8021e4a8>] usercopy_warn+0x98/0xe8
      [<ffffffff8021e68c>] __check_object_size+0xfc/0x250
      [<ffffffff801bbfb8>] put_compat_sigset+0x30/0x88
      [<ffffffff8011af24>] setup_rt_frame_n32+0xc4/0x160
      [<ffffffff8010b8b4>] do_signal+0x19c/0x230
      [<ffffffff8010c408>] do_notify_resume+0x60/0x78
      [<ffffffff80106f50>] work_notifysig+0x10/0x18
      ---[ end trace 88fffbf69147f48a ]---
      
      Commit 5905429a ("fork: Provide usercopy whitelisting for
      task_struct") noted that:
      
      "While the blocked and saved_sigmask fields of task_struct are copied to
      userspace (via sigmask_to_save() and setup_rt_frame()), it is always
      copied with a static length (i.e. sizeof(sigset_t))."
      
      However, this is not true in the case of compat signals, whose sigset
      is copied by put_compat_sigset and receives size as an argument.
      
      At most call sites, put_compat_sigset is copying a sigset from the
      current task_struct. This triggers a warning when
      CONFIG_HARDENED_USERCOPY is active. However, by marking this function as
      static inline, the warning can be avoided because in all of these cases
      the size is constant at compile time, which is allowed. The only site
      where this is not the case is handling the rt_sigpending syscall, but
      there the copy is being made from a stack local variable so does not
      trigger the warning.
      
      Move put_compat_sigset to compat.h, and mark it static inline. This
      fixes the WARN on MIPS.
      
      Fixes: afcc90f8 ("usercopy: WARN() on slab cache usercopy region violations")
      Signed-off-by: NMatt Redfearn <matt.redfearn@mips.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: "Dmitry V . Levin" <ldv@altlinux.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: kernel-hardening@lists.openwall.com
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/18639/Signed-off-by: NJames Hogan <jhogan@kernel.org>
      fde9fc76
  18. 02 3月, 2018 3 次提交
  19. 01 3月, 2018 1 次提交