1. 06 5月, 2016 4 次提交
    • A
      byteswap: try to avoid __builtin_constant_p gcc bug · 7322dd75
      Arnd Bergmann 提交于
      This is another attempt to avoid a regression in wwn_to_u64() after that
      started using get_unaligned_be64(), which in turn ran into a bug on
      gcc-4.9 through 6.1.
      
      The regression got introduced due to the combination of two separate
      workarounds (commits e3bde956: "include/linux/unaligned: force
      inlining of byteswap operations" and ef3fb242: "scsi: fc: use
      get/put_unaligned64 for wwn access") that each try to sidestep distinct
      problems with gcc behavior (code growth and increased stack usage).
      
      Unfortunately after both have been applied, a more serious gcc bug has
      been uncovered, leading to incorrect object code that discards part of a
      function and causes undefined behavior.
      
      As part of this problem is how __builtin_constant_p gets evaluated on an
      argument passed by reference into an inline function, this avoids the
      use of __builtin_constant_p() for all architectures that set
      CONFIG_ARCH_USE_BUILTIN_BSWAP.  Most architectures do not set
      ARCH_SUPPORTS_OPTIMIZED_INLINING, which means they probably do not
      suffer from the problem in the qla2xxx driver, but they might still run
      into it elsewhere.
      
      Both of the original workarounds were only merged in the 4.6 kernel, and
      the bug that is fixed by this patch should only appear if both are
      there, so we probably don't need to backport the fix.  On the other
      hand, it works by simplifying the code path and should not have any
      negative effects.
      
      [arnd@arndb.de: fix older gcc warnings]
        (http://lkml.kernel.org/r/12243652.bxSxEgjgfk@wuerfel)
      Link: https://lkml.org/lkml/headers/2016/4/12/1103
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70232
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70646
      Fixes: e3bde956 ("include/linux/unaligned: force inlining of byteswap operations")
      Fixes: ef3fb242 ("scsi: fc: use get/put_unaligned64 for wwn access")
      Link: http://lkml.kernel.org/r/1780465.XdtPJpi8Tt@wuerfelSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Tested-by: Josh Poimboeuf <jpoimboe@redhat.com> # on gcc-5.3
      Tested-by: NQuinn Tran <quinn.tran@qlogic.com>
      Cc: Martin Jambor <mjambor@suse.cz>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Himanshu Madhani <himanshu.madhani@qlogic.com>
      Cc: Jan Hubicka <hubicka@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7322dd75
    • A
      mm: thp: kvm: fix memory corruption in KVM with THP enabled · 127393fb
      Andrea Arcangeli 提交于
      After the THP refcounting change, obtaining a compound pages from
      get_user_pages() no longer allows us to assume the entire compound page
      is immediately mappable from a secondary MMU.
      
      A secondary MMU doesn't want to call get_user_pages() more than once for
      each compound page, in order to know if it can map the whole compound
      page.  So a secondary MMU needs to know from a single get_user_pages()
      invocation when it can map immediately the entire compound page to avoid
      a flood of unnecessary secondary MMU faults and spurious
      atomic_inc()/atomic_dec() (pages don't have to be pinned by MMU notifier
      users).
      
      Ideally instead of the page->_mapcount < 1 check, get_user_pages()
      should return the granularity of the "page" mapping in the "mm" passed
      to get_user_pages().  However it's non trivial change to pass the "pmd"
      status belonging to the "mm" walked by get_user_pages up the stack (up
      to the caller of get_user_pages).  So the fix just checks if there is
      not a single pte mapping on the page returned by get_user_pages, and in
      turn if the caller can assume that the whole compound page is mapped in
      the current "mm" (in a pmd_trans_huge()).  In such case the entire
      compound page is safe to map into the secondary MMU without additional
      get_user_pages() calls on the surrounding tail/head pages.  In addition
      of being faster, not having to run other get_user_pages() calls also
      reduces the memory footprint of the secondary MMU fault in case the pmd
      split happened as result of memory pressure.
      
      Without this fix after a MADV_DONTNEED (like invoked by QEMU during
      postcopy live migration or balloning) or after generic swapping (with a
      failure in split_huge_page() that would only result in pmd splitting and
      not a physical page split), KVM would map the whole compound page into
      the shadow pagetables, despite regular faults or userfaults (like
      UFFDIO_COPY) may map regular pages into the primary MMU as result of the
      pte faults, leading to the guest mode and userland mode going out of
      sync and not working on the same memory at all times.
      
      Any other secondary MMU notifier manager (KVM is just one of the many
      MMU notifier users) will need the same information if it doesn't want to
      run a flood of get_user_pages_fast and it can support multiple
      granularity in the secondary MMU mappings, so I think it is justified to
      be exposed not just to KVM.
      
      The other option would be to move transparent_hugepage_adjust to
      mm/huge_memory.c but that currently has all kind of KVM data structures
      in it, so it's definitely not a cut-and-paste work, so I couldn't do a
      fix as cleaner as this one for 4.6.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: "Li, Liang Z" <liang.z.li@intel.com>
      Cc: Amit Shah <amit.shah@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      127393fb
    • A
      rapidio/mport_cdev: fix uapi type definitions · 4e1016da
      Alexandre Bounine 提交于
      Fix problems in uapi definitions reported by Gabriel Laskar: (see
      https://lkml.org/lkml/2016/4/5/205 for details)
      
       - move public header file rio_mport_cdev.h to include/uapi/linux directory
       - change types in data structures passed as IOCTL parameters
       - improve parameter checking in some IOCTL service routines
      Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com>
      Reported-by: NGabriel Laskar <gabriel@lse.epita.fr>
      Tested-by: NBarry Wood <barry.wood@idt.com>
      Cc: Gabriel Laskar <gabriel@lse.epita.fr>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com>
      Cc: Barry Wood <barry.wood@idt.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e1016da
    • J
      mm: memcontrol: let v2 cgroups follow changes in system swappiness · 4550c4e1
      Johannes Weiner 提交于
      Cgroup2 currently doesn't have a per-cgroup swappiness setting.  We
      might want to add one later - that's a different discussion - but until
      we do, the cgroups should always follow the system setting.  Otherwise
      it will be unchangeably set to whatever the ancestor inherited from the
      system setting at the time of cgroup creation.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: <stable@vger.kernel.org>	[4.5]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4550c4e1
  2. 05 5月, 2016 1 次提交
  3. 04 5月, 2016 1 次提交
  4. 03 5月, 2016 1 次提交
    • L
      Minimal fix-up of bad hashing behavior of hash_64() · 689de1d6
      Linus Torvalds 提交于
      This is a fairly minimal fixup to the horribly bad behavior of hash_64()
      with certain input patterns.
      
      In particular, because the multiplicative value used for the 64-bit hash
      was intentionally bit-sparse (so that the multiply could be done with
      shifts and adds on architectures without hardware multipliers), some
      bits did not get spread out very much.  In particular, certain fairly
      common bit ranges in the input (roughly bits 12-20: commonly with the
      most information in them when you hash things like byte offsets in files
      or memory that have block factors that mean that the low bits are often
      zero) would not necessarily show up much in the result.
      
      There's a bigger patch-series brewing to fix up things more completely,
      but this is the fairly minimal fix for the 64-bit hashing problem.  It
      simply picks a much better constant multiplier, spreading the bits out a
      lot better.
      
      NOTE! For 32-bit architectures, the bad old hash_64() remains the same
      for now, since 64-bit multiplies are expensive.  The bigger hashing
      cleanup will replace the 32-bit case with something better.
      
      The new constants were picked by George Spelvin who wrote that bigger
      cleanup series.  I just picked out the constants and part of the comment
      from that series.
      
      Cc: stable@vger.kernel.org
      Cc: George Spelvin <linux@horizon.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      689de1d6
  5. 02 5月, 2016 1 次提交
    • T
      net: Implement net_dbg_ratelimited() for CONFIG_DYNAMIC_DEBUG case · 2c94b537
      Tim Bingham 提交于
      Prior to commit d92cff89 ("net_dbg_ratelimited: turn into no-op
      when !DEBUG") the implementation of net_dbg_ratelimited() was buggy
      for both the DEBUG and CONFIG_DYNAMIC_DEBUG cases.
      
      The bug was that net_ratelimit() was being called and, despite
      returning true, nothing was being printed to the console. This
      resulted in messages like the following -
      
      "net_ratelimit: %d callbacks suppressed"
      
      with no other output nearby.
      
      After commit d92cff89 ("net_dbg_ratelimited: turn into no-op when
      !DEBUG") the bug is fixed for the DEBUG case. However, there's no
      output at all for CONFIG_DYNAMIC_DEBUG case.
      
      This patch restores debug output (if enabled) for the
      CONFIG_DYNAMIC_DEBUG case.
      
      Add a definition of net_dbg_ratelimited() for the CONFIG_DYNAMIC_DEBUG
      case. The implementation takes care to check that dynamic debugging is
      enabled before calling net_ratelimit().
      
      Fixes: d92cff89 ("net_dbg_ratelimited: turn into no-op when !DEBUG")
      Signed-off-by: NTim Bingham <tbingham@akamai.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c94b537
  6. 29 4月, 2016 6 次提交
    • G
      numa: fix /proc/<pid>/numa_maps for THP · 28093f9f
      Gerald Schaefer 提交于
      In gather_pte_stats() a THP pmd is cast into a pte, which is wrong
      because the layouts may differ depending on the architecture.  On s390
      this will lead to inaccurate numa_maps accounting in /proc because of
      misguided pte_present() and pte_dirty() checks on the fake pte.
      
      On other architectures pte_present() and pte_dirty() may work by chance,
      but there may be an issue with direct-access (dax) mappings w/o
      underlying struct pages when HAVE_PTE_SPECIAL is set and THP is
      available.  In vm_normal_page() the fake pte will be checked with
      pte_special() and because there is no "special" bit in a pmd, this will
      always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be
      skipped.  On dax mappings w/o struct pages, an invalid struct page
      pointer would then be returned that can crash the kernel.
      
      This patch fixes the numa_maps THP handling by introducing new "_pmd"
      variants of the can_gather_numa_stats() and vm_normal_page() functions.
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.3+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28093f9f
    • K
      thp: keep huge zero page pinned until tlb flush · aa88b68c
      Kirill A. Shutemov 提交于
      Andrea has found[1] a race condition on MMU-gather based TLB flush vs
      split_huge_page() or shrinker which frees huge zero under us (patch 1/2
      and 2/2 respectively).
      
      With new THP refcounting, we don't need patch 1/2: mmu_gather keeps the
      page pinned until flush is complete and the pin prevents the page from
      being split under us.
      
      We still need patch 2/2.  This is simplified version of Andrea's patch.
      We don't need fancy encoding.
      
      [1] http://lkml.kernel.org/r/1447938052-22165-1-git-send-email-aarcange@redhat.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa88b68c
    • S
      mm: exclude HugeTLB pages from THP page_mapped() logic · 66ee95d1
      Steve Capper 提交于
      HugeTLB pages cannot be split, so we use the compound_mapcount to track
      rmaps.
      
      Currently page_mapped() will check the compound_mapcount, but will also
      go through the constituent pages of a THP compound page and query the
      individual _mapcount's too.
      
      Unfortunately, page_mapped() does not distinguish between HugeTLB and
      THP compound pages and assumes that a compound page always needs to have
      HPAGE_PMD_NR pages querying.
      
      For most cases when dealing with HugeTLB this is just inefficient, but
      for scenarios where the HugeTLB page size is less than the pmd block
      size (e.g.  when using contiguous bit on ARM) this can lead to crashes.
      
      This patch adjusts the page_mapped function such that we skip the
      unnecessary THP reference checks for HugeTLB pages.
      
      Fixes: e1534ae9 ("mm: differentiate page_mapped() from page_mapcount() for compound pages")
      Signed-off-by: NSteve Capper <steve.capper@arm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66ee95d1
    • A
      bpf: fix refcnt overflow · 92117d84
      Alexei Starovoitov 提交于
      On a system with >32Gbyte of phyiscal memory and infinite RLIMIT_MEMLOCK,
      the malicious application may overflow 32-bit bpf program refcnt.
      It's also possible to overflow map refcnt on 1Tb system.
      Impose 32k hard limit which means that the same bpf program or
      map cannot be shared by more than 32k processes.
      
      Fixes: 1be7f75d ("bpf: enable non-root eBPF programs")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92117d84
    • M
      net: fix net_gso_ok for new GSO types. · 7b748340
      Marcelo Ricardo Leitner 提交于
      Fix casting in net_gso_ok. Otherwise the shift on
      gso_type << NETIF_F_GSO_SHIFT may hit the 32th bit and make it look like
      a INT_MIN, which is then promoted from signed to uint64 which is
      0xffffffff80000000, resulting in wrong behavior when it is and'ed with
      the feature itself, as in:
      
      This test app:
      #include <stdio.h>
      #include <stdint.h>
      
      int main(int argc, char **argv)
      {
      	uint64_t feature1;
      	uint64_t feature2;
      	int gso_type = 1 << 15;
      
      	feature1 = gso_type << 16;
      	feature2 = (uint64_t)gso_type << 16;
      	printf("%lx %lx\n", feature1, feature2);
      
      	return 0;
      }
      
      Gives:
      ffffffff80000000 80000000
      
      So that this:
         return (features & feature) == feature;
      Actually works on more bits than expected and invalid ones.
      
      Fix is to promote it earlier.
      
      Issue noted while rebasing SCTP GSO patch but posting separetely as
      someone else may experience this meanwhile.
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b748340
    • J
      IB/security: Restrict use of the write() interface · e6bd18f5
      Jason Gunthorpe 提交于
      The drivers/infiniband stack uses write() as a replacement for
      bi-directional ioctl().  This is not safe. There are ways to
      trigger write calls that result in the return structure that
      is normally written to user space being shunted off to user
      specified kernel memory instead.
      
      For the immediate repair, detect and deny suspicious accesses to
      the write API.
      
      For long term, update the user space libraries and the kernel API
      to something that doesn't present the same security vulnerabilities
      (likely a structured ioctl() interface).
      
      The impacted uAPI interfaces are generally only available if
      hardware from drivers/infiniband is installed in the system.
      Reported-by: NJann Horn <jann@thejh.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      [ Expanded check to all known write() entry points ]
      Cc: stable@vger.kernel.org
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      e6bd18f5
  7. 28 4月, 2016 14 次提交
  8. 27 4月, 2016 1 次提交
    • L
      devpts: more pty driver interface cleanups · 8ead9dd5
      Linus Torvalds 提交于
      This is more prep-work for the upcoming pty changes.  Still just code
      cleanup with no actual semantic changes.
      
      This removes a bunch pointless complexity by just having the slave pty
      side remember the dentry associated with the devpts slave rather than
      the inode.  That allows us to remove all the "look up the dentry" code
      for when we want to remove it again.
      
      Together with moving the tty pointer from "inode->i_private" to
      "dentry->d_fsdata" and getting rid of pointless inode locking, this
      removes about 30 lines of code.  Not only is the end result smaller,
      it's simpler and easier to understand.
      
      The old code, for example, depended on the d_find_alias() to not just
      find the dentry, but also to check that it is still hashed, which in
      turn validated the tty pointer in the inode.
      
      That is a _very_ roundabout way to say "invalidate the cached tty
      pointer when the dentry is removed".
      
      The new code just does
      
      	dentry->d_fsdata = NULL;
      
      in devpts_pty_kill() instead, invalidating the tty pointer rather more
      directly and obviously.  Don't do something complex and subtle when the
      obvious straightforward approach will do.
      
      The rest of the patch (ie apart from code deletion and the above tty
      pointer clearing) is just switching the calling convention to pass the
      dentry or file pointer around instead of the inode.
      
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Aurelien Jarno <aurelien@aurel32.net>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Jann Horn <jann@thejh.net>
      Cc: Greg KH <greg@kroah.com>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ead9dd5
  9. 26 4月, 2016 5 次提交
    • D
      Revert "ipv6: Revert optional address flusing on ifdown." · 6a923934
      David S. Miller 提交于
      This reverts commit 841645b5.
      
      Ok, this puts the feature back.  I've decided to apply David A.'s
      bug fix and run with that rather than make everyone wait another
      whole release for this feature.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a923934
    • T
      ALSA: hda - Update BCLK also at hotplug for i915 HSW/BDW · bb03ed21
      Takashi Iwai 提交于
      The recent bug report suggests that BCLK setup for i915 HSW/BDW needs
      to be updated at each HDMI hotplug, not only at initialization and
      resume.  That is, we need to update HSW_EM4 and HSW_EM5 registers at
      ELD notification, too.  Otherwise the HDMI audio may be out of sync
      and played in a wrong pitch.
      
      However, the HDA codec driver has no access to the controller
      registers, and currently the code managing these registers is in
      hda_intel.c, i.e. local to the controller driver.  For allowing the
      explicit BCLK update from the codec driver, as in this patch, the
      former haswell_set_bclk() in hda_intel.c is moved to hdac_i915.c and
      exposed as snd_hdac_i915_set_bclk().  This is called from both the HDA
      controller driver and intel_pin_eld_notify() in HDMI codec driver.
      
      Along with this change, snd_hdac_get_display_clk() gets dropped as
      it's no longer used.
      
      Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=91410
      Cc: <stable@vger.kernel.org> # v4.5+
      Signed-off-by: NTakashi Iwai <tiwai@suse.de>
      bb03ed21
    • T
      cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback · 5cf1cacb
      Tejun Heo 提交于
      Since e93ad19d ("cpuset: make mm migration asynchronous"), cpuset
      kicks off asynchronous NUMA node migration if necessary during task
      migration and flushes it from cpuset_post_attach_flush() which is
      called at the end of __cgroup_procs_write().  This is to avoid
      performing migration with cgroup_threadgroup_rwsem write-locked which
      can lead to deadlock through dependency on kworker creation.
      
      memcg has a similar issue with charge moving, so let's convert it to
      an official callback rather than the current one-off cpuset specific
      function.  This patch adds cgroup_subsys->post_attach callback and
      makes cpuset register cpuset_post_attach_flush() as its ->post_attach.
      
      The conversion is mostly one-to-one except that the new callback is
      called under cgroup_mutex.  This is to guarantee that no other
      migration operations are started before ->post_attach callbacks are
      finished.  cgroup_mutex is one of the outermost mutex in the system
      and has never been and shouldn't be a problem.  We can add specialized
      synchronization around __cgroup_procs_write() but I don't think
      there's any noticeable benefit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org> # 4.4+ prerequisite for the next patch
      5cf1cacb
    • D
      ipv6: Revert optional address flusing on ifdown. · 841645b5
      David S. Miller 提交于
      This reverts the following three commits:
      
      70af921d
      799977d9
      f1705ec1
      
      The feature was ill conceived, has terrible semantics, and has added
      nothing but regressions to the already fragile ipv6 stack.
      
      Fixes: f1705ec1 ("net: ipv6: Make address flushing on ifdown optional")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      841645b5
    • I
      libceph: make authorizer destruction independent of ceph_auth_client · 6c1ea260
      Ilya Dryomov 提交于
      Starting the kernel client with cephx disabled and then enabling cephx
      and restarting userspace daemons can result in a crash:
      
          [262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000
          [262671.531460] IP: [<ffffffff811cd04a>] kfree+0x5a/0x130
          [262671.584334] PGD 0
          [262671.635847] Oops: 0000 [#1] SMP
          [262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu
          [262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014
          [262672.268937] Workqueue: ceph-msgr con_work [libceph]
          [262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000
          [262672.428330] RIP: 0010:[<ffffffff811cd04a>]  [<ffffffff811cd04a>] kfree+0x5a/0x130
          [262672.535880] RSP: 0018:ffff880149aeba58  EFLAGS: 00010286
          [262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018
          [262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012
          [262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000
          [262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40
          [262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840
          [262673.131722] FS:  0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000
          [262673.245377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          [262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0
          [262673.417556] Stack:
          [262673.472943]  ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04
          [262673.583767]  ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00
          [262673.694546]  ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800
          [262673.805230] Call Trace:
          [262673.859116]  [<ffffffffc035f70f>] ceph_x_destroy_authorizer+0x1f/0x50 [libceph]
          [262673.968705]  [<ffffffffc035c89e>] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph]
          [262674.078852]  [<ffffffffc0352805>] put_osd+0x45/0x80 [libceph]
          [262674.134249]  [<ffffffffc035290e>] remove_osd+0xae/0x140 [libceph]
          [262674.189124]  [<ffffffffc0352aa3>] __reset_osd+0x103/0x150 [libceph]
          [262674.243749]  [<ffffffffc0354703>] kick_requests+0x223/0x460 [libceph]
          [262674.297485]  [<ffffffffc03559e2>] ceph_osdc_handle_map+0x282/0x5e0 [libceph]
          [262674.350813]  [<ffffffffc035022e>] dispatch+0x4e/0x720 [libceph]
          [262674.403312]  [<ffffffffc034bd91>] try_read+0x3d1/0x1090 [libceph]
          [262674.454712]  [<ffffffff810ab7c2>] ? dequeue_entity+0x152/0x690
          [262674.505096]  [<ffffffffc034cb1b>] con_work+0xcb/0x1300 [libceph]
          [262674.555104]  [<ffffffff8108fb3e>] process_one_work+0x14e/0x3d0
          [262674.604072]  [<ffffffff810901ea>] worker_thread+0x11a/0x470
          [262674.652187]  [<ffffffff810900d0>] ? rescuer_thread+0x310/0x310
          [262674.699022]  [<ffffffff810957a2>] kthread+0xd2/0xf0
          [262674.744494]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
          [262674.789543]  [<ffffffff817bd81f>] ret_from_fork+0x3f/0x70
          [262674.834094]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
      
      What happens is the following:
      
          (1) new MON session is established
          (2) old "none" ac is destroyed
          (3) new "cephx" ac is constructed
          ...
          (4) old OSD session (w/ "none" authorizer) is put
                ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer)
      
      osd->o_auth.authorizer in the "none" case is just a bare pointer into
      ac, which contains a single static copy for all services.  By the time
      we get to (4), "none" ac, freed in (2), is long gone.  On top of that,
      a new vtable installed in (3) points us at ceph_x_destroy_authorizer(),
      so we end up trying to destroy a "none" authorizer with a "cephx"
      destructor operating on invalid memory!
      
      To fix this, decouple authorizer destruction from ac and do away with
      a single static "none" authorizer by making a copy for each OSD or MDS
      session.  Authorizers themselves are independent of ac and so there is
      no reason for destroy_authorizer() to be an ac op.  Make it an op on
      the authorizer itself by turning ceph_authorizer into a real struct.
      
      Fixes: http://tracker.ceph.com/issues/15447Reported-by: NAlan Zhang <alan.zhang@linux.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NSage Weil <sage@redhat.com>
      6c1ea260
  10. 25 4月, 2016 6 次提交