1. 29 4月, 2016 4 次提交
    • G
      numa: fix /proc/<pid>/numa_maps for THP · 28093f9f
      Gerald Schaefer 提交于
      In gather_pte_stats() a THP pmd is cast into a pte, which is wrong
      because the layouts may differ depending on the architecture.  On s390
      this will lead to inaccurate numa_maps accounting in /proc because of
      misguided pte_present() and pte_dirty() checks on the fake pte.
      
      On other architectures pte_present() and pte_dirty() may work by chance,
      but there may be an issue with direct-access (dax) mappings w/o
      underlying struct pages when HAVE_PTE_SPECIAL is set and THP is
      available.  In vm_normal_page() the fake pte will be checked with
      pte_special() and because there is no "special" bit in a pmd, this will
      always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be
      skipped.  On dax mappings w/o struct pages, an invalid struct page
      pointer would then be returned that can crash the kernel.
      
      This patch fixes the numa_maps THP handling by introducing new "_pmd"
      variants of the can_gather_numa_stats() and vm_normal_page() functions.
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.3+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28093f9f
    • K
      thp: keep huge zero page pinned until tlb flush · aa88b68c
      Kirill A. Shutemov 提交于
      Andrea has found[1] a race condition on MMU-gather based TLB flush vs
      split_huge_page() or shrinker which frees huge zero under us (patch 1/2
      and 2/2 respectively).
      
      With new THP refcounting, we don't need patch 1/2: mmu_gather keeps the
      page pinned until flush is complete and the pin prevents the page from
      being split under us.
      
      We still need patch 2/2.  This is simplified version of Andrea's patch.
      We don't need fancy encoding.
      
      [1] http://lkml.kernel.org/r/1447938052-22165-1-git-send-email-aarcange@redhat.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa88b68c
    • S
      mm: exclude HugeTLB pages from THP page_mapped() logic · 66ee95d1
      Steve Capper 提交于
      HugeTLB pages cannot be split, so we use the compound_mapcount to track
      rmaps.
      
      Currently page_mapped() will check the compound_mapcount, but will also
      go through the constituent pages of a THP compound page and query the
      individual _mapcount's too.
      
      Unfortunately, page_mapped() does not distinguish between HugeTLB and
      THP compound pages and assumes that a compound page always needs to have
      HPAGE_PMD_NR pages querying.
      
      For most cases when dealing with HugeTLB this is just inefficient, but
      for scenarios where the HugeTLB page size is less than the pmd block
      size (e.g.  when using contiguous bit on ARM) this can lead to crashes.
      
      This patch adjusts the page_mapped function such that we skip the
      unnecessary THP reference checks for HugeTLB pages.
      
      Fixes: e1534ae9 ("mm: differentiate page_mapped() from page_mapcount() for compound pages")
      Signed-off-by: NSteve Capper <steve.capper@arm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66ee95d1
    • J
      IB/security: Restrict use of the write() interface · e6bd18f5
      Jason Gunthorpe 提交于
      The drivers/infiniband stack uses write() as a replacement for
      bi-directional ioctl().  This is not safe. There are ways to
      trigger write calls that result in the return structure that
      is normally written to user space being shunted off to user
      specified kernel memory instead.
      
      For the immediate repair, detect and deny suspicious accesses to
      the write API.
      
      For long term, update the user space libraries and the kernel API
      to something that doesn't present the same security vulnerabilities
      (likely a structured ioctl() interface).
      
      The impacted uAPI interfaces are generally only available if
      hardware from drivers/infiniband is installed in the system.
      Reported-by: NJann Horn <jann@thejh.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      [ Expanded check to all known write() entry points ]
      Cc: stable@vger.kernel.org
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      e6bd18f5
  2. 28 4月, 2016 1 次提交
  3. 27 4月, 2016 1 次提交
    • L
      devpts: more pty driver interface cleanups · 8ead9dd5
      Linus Torvalds 提交于
      This is more prep-work for the upcoming pty changes.  Still just code
      cleanup with no actual semantic changes.
      
      This removes a bunch pointless complexity by just having the slave pty
      side remember the dentry associated with the devpts slave rather than
      the inode.  That allows us to remove all the "look up the dentry" code
      for when we want to remove it again.
      
      Together with moving the tty pointer from "inode->i_private" to
      "dentry->d_fsdata" and getting rid of pointless inode locking, this
      removes about 30 lines of code.  Not only is the end result smaller,
      it's simpler and easier to understand.
      
      The old code, for example, depended on the d_find_alias() to not just
      find the dentry, but also to check that it is still hashed, which in
      turn validated the tty pointer in the inode.
      
      That is a _very_ roundabout way to say "invalidate the cached tty
      pointer when the dentry is removed".
      
      The new code just does
      
      	dentry->d_fsdata = NULL;
      
      in devpts_pty_kill() instead, invalidating the tty pointer rather more
      directly and obviously.  Don't do something complex and subtle when the
      obvious straightforward approach will do.
      
      The rest of the patch (ie apart from code deletion and the above tty
      pointer clearing) is just switching the calling convention to pass the
      dentry or file pointer around instead of the inode.
      
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Aurelien Jarno <aurelien@aurel32.net>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Jann Horn <jann@thejh.net>
      Cc: Greg KH <greg@kroah.com>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ead9dd5
  4. 26 4月, 2016 5 次提交
    • D
      Revert "ipv6: Revert optional address flusing on ifdown." · 6a923934
      David S. Miller 提交于
      This reverts commit 841645b5.
      
      Ok, this puts the feature back.  I've decided to apply David A.'s
      bug fix and run with that rather than make everyone wait another
      whole release for this feature.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a923934
    • T
      ALSA: hda - Update BCLK also at hotplug for i915 HSW/BDW · bb03ed21
      Takashi Iwai 提交于
      The recent bug report suggests that BCLK setup for i915 HSW/BDW needs
      to be updated at each HDMI hotplug, not only at initialization and
      resume.  That is, we need to update HSW_EM4 and HSW_EM5 registers at
      ELD notification, too.  Otherwise the HDMI audio may be out of sync
      and played in a wrong pitch.
      
      However, the HDA codec driver has no access to the controller
      registers, and currently the code managing these registers is in
      hda_intel.c, i.e. local to the controller driver.  For allowing the
      explicit BCLK update from the codec driver, as in this patch, the
      former haswell_set_bclk() in hda_intel.c is moved to hdac_i915.c and
      exposed as snd_hdac_i915_set_bclk().  This is called from both the HDA
      controller driver and intel_pin_eld_notify() in HDMI codec driver.
      
      Along with this change, snd_hdac_get_display_clk() gets dropped as
      it's no longer used.
      
      Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=91410
      Cc: <stable@vger.kernel.org> # v4.5+
      Signed-off-by: NTakashi Iwai <tiwai@suse.de>
      bb03ed21
    • T
      cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback · 5cf1cacb
      Tejun Heo 提交于
      Since e93ad19d ("cpuset: make mm migration asynchronous"), cpuset
      kicks off asynchronous NUMA node migration if necessary during task
      migration and flushes it from cpuset_post_attach_flush() which is
      called at the end of __cgroup_procs_write().  This is to avoid
      performing migration with cgroup_threadgroup_rwsem write-locked which
      can lead to deadlock through dependency on kworker creation.
      
      memcg has a similar issue with charge moving, so let's convert it to
      an official callback rather than the current one-off cpuset specific
      function.  This patch adds cgroup_subsys->post_attach callback and
      makes cpuset register cpuset_post_attach_flush() as its ->post_attach.
      
      The conversion is mostly one-to-one except that the new callback is
      called under cgroup_mutex.  This is to guarantee that no other
      migration operations are started before ->post_attach callbacks are
      finished.  cgroup_mutex is one of the outermost mutex in the system
      and has never been and shouldn't be a problem.  We can add specialized
      synchronization around __cgroup_procs_write() but I don't think
      there's any noticeable benefit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org> # 4.4+ prerequisite for the next patch
      5cf1cacb
    • D
      ipv6: Revert optional address flusing on ifdown. · 841645b5
      David S. Miller 提交于
      This reverts the following three commits:
      
      70af921d
      799977d9
      f1705ec1
      
      The feature was ill conceived, has terrible semantics, and has added
      nothing but regressions to the already fragile ipv6 stack.
      
      Fixes: f1705ec1 ("net: ipv6: Make address flushing on ifdown optional")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      841645b5
    • I
      libceph: make authorizer destruction independent of ceph_auth_client · 6c1ea260
      Ilya Dryomov 提交于
      Starting the kernel client with cephx disabled and then enabling cephx
      and restarting userspace daemons can result in a crash:
      
          [262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000
          [262671.531460] IP: [<ffffffff811cd04a>] kfree+0x5a/0x130
          [262671.584334] PGD 0
          [262671.635847] Oops: 0000 [#1] SMP
          [262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu
          [262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014
          [262672.268937] Workqueue: ceph-msgr con_work [libceph]
          [262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000
          [262672.428330] RIP: 0010:[<ffffffff811cd04a>]  [<ffffffff811cd04a>] kfree+0x5a/0x130
          [262672.535880] RSP: 0018:ffff880149aeba58  EFLAGS: 00010286
          [262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018
          [262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012
          [262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000
          [262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40
          [262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840
          [262673.131722] FS:  0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000
          [262673.245377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          [262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0
          [262673.417556] Stack:
          [262673.472943]  ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04
          [262673.583767]  ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00
          [262673.694546]  ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800
          [262673.805230] Call Trace:
          [262673.859116]  [<ffffffffc035f70f>] ceph_x_destroy_authorizer+0x1f/0x50 [libceph]
          [262673.968705]  [<ffffffffc035c89e>] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph]
          [262674.078852]  [<ffffffffc0352805>] put_osd+0x45/0x80 [libceph]
          [262674.134249]  [<ffffffffc035290e>] remove_osd+0xae/0x140 [libceph]
          [262674.189124]  [<ffffffffc0352aa3>] __reset_osd+0x103/0x150 [libceph]
          [262674.243749]  [<ffffffffc0354703>] kick_requests+0x223/0x460 [libceph]
          [262674.297485]  [<ffffffffc03559e2>] ceph_osdc_handle_map+0x282/0x5e0 [libceph]
          [262674.350813]  [<ffffffffc035022e>] dispatch+0x4e/0x720 [libceph]
          [262674.403312]  [<ffffffffc034bd91>] try_read+0x3d1/0x1090 [libceph]
          [262674.454712]  [<ffffffff810ab7c2>] ? dequeue_entity+0x152/0x690
          [262674.505096]  [<ffffffffc034cb1b>] con_work+0xcb/0x1300 [libceph]
          [262674.555104]  [<ffffffff8108fb3e>] process_one_work+0x14e/0x3d0
          [262674.604072]  [<ffffffff810901ea>] worker_thread+0x11a/0x470
          [262674.652187]  [<ffffffff810900d0>] ? rescuer_thread+0x310/0x310
          [262674.699022]  [<ffffffff810957a2>] kthread+0xd2/0xf0
          [262674.744494]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
          [262674.789543]  [<ffffffff817bd81f>] ret_from_fork+0x3f/0x70
          [262674.834094]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
      
      What happens is the following:
      
          (1) new MON session is established
          (2) old "none" ac is destroyed
          (3) new "cephx" ac is constructed
          ...
          (4) old OSD session (w/ "none" authorizer) is put
                ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer)
      
      osd->o_auth.authorizer in the "none" case is just a bare pointer into
      ac, which contains a single static copy for all services.  By the time
      we get to (4), "none" ac, freed in (2), is long gone.  On top of that,
      a new vtable installed in (3) points us at ceph_x_destroy_authorizer(),
      so we end up trying to destroy a "none" authorizer with a "cephx"
      destructor operating on invalid memory!
      
      To fix this, decouple authorizer destruction from ac and do away with
      a single static "none" authorizer by making a copy for each OSD or MDS
      session.  Authorizers themselves are independent of ac and so there is
      no reason for destroy_authorizer() to be an ac op.  Make it an op on
      the authorizer itself by turning ceph_authorizer into a real struct.
      
      Fixes: http://tracker.ceph.com/issues/15447Reported-by: NAlan Zhang <alan.zhang@linux.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NSage Weil <sage@redhat.com>
      6c1ea260
  5. 25 4月, 2016 7 次提交
  6. 24 4月, 2016 1 次提交
  7. 23 4月, 2016 1 次提交
    • P
      lockdep: Fix lock_chain::base size · 75dd602a
      Peter Zijlstra 提交于
      lock_chain::base is used to store an index into the chain_hlocks[]
      array, however that array contains more elements than can be indexed
      using the u16.
      
      Change the lock_chain structure to use a bitfield to encode the data
      it needs and add BUILD_BUG_ON() assertions to check the fields are
      wide enough.
      
      Also, for DEBUG_LOCKDEP, assert that we don't run out of elements of
      that array; as that would wreck the collision detectoring.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alfredo Alvarez Fernandez <alfredoalvarezfernandez@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160330093659.GS3408@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      75dd602a
  8. 22 4月, 2016 3 次提交
  9. 21 4月, 2016 3 次提交
    • T
      ALSA: hda - Fix possible race on regmap bypass flip · 3194ed49
      Takashi Iwai 提交于
      HD-audio driver uses regmap cache bypass feature for reading a raw
      value without the cache.  But this is racy since both the cached and
      the uncached reads may occur concurrently.  The former is done via the
      normal control API access while the latter comes from the proc file
      read.
      
      Even though the regmap itself has the protection against the
      concurrent accesses, the flag set/reset is done without the
      protection, so it may lead to inconsistent state of bypass flag that
      doesn't match with the current read and occasionally result in a
      kernel WARNING like:
        WARNING: CPU: 3 PID: 2731 at drivers/base/regmap/regcache.c:499 regcache_cache_only+0x78/0x93
      
      One way to work around such a problem is to wrap with a mutex.  But in
      this case, the solution is simpler: for the uncached read, we just
      skip the regmap and directly calls its accessor.  The verb execution
      there is protected by itself, so basically it's safe to call
      individually.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=116171Signed-off-by: NTakashi Iwai <tiwai@suse.de>
      3194ed49
    • R
      asm-generic/futex: Re-enable preemption in futex_atomic_cmpxchg_inatomic() · fba7cd68
      Romain Perier 提交于
      The recent decoupling of pagefault disable and preempt disable added an
      explicit preempt_disable/enable() pair to the futex_atomic_cmpxchg_inatomic()
      implementation in asm-generic/futex.h. But it forgot to add preempt_enable()
      calls to the error handling code pathes, which results in a preemption count
      imbalance.
      
      This is observable on boot when the test for atomic_cmpxchg() is calling
      futex_atomic_cmpxchg_inatomic() on a NULL pointer.
      
      Add the missing preempt_enable() calls to the error handling code pathes.
      
      [ tglx: Massaged changelog ]
      
      Fixes: d9b9ff8c ("sched/preempt, futex: Disable preemption in UP futex_atomic_cmpxchg_inatomic() explicitly")
      Signed-off-by: NRomain Perier <romain.perier@free-electrons.com>
      Cc: linux-arch@vger.kernel.org
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1460640963-690-1-git-send-email-romain.perier@free-electrons.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      fba7cd68
    • W
      thermal: consistently use int for trip temp · 1d0fd42f
      Wei Ni 提交于
      The commit 17e8351a consistently use int for temperature,
      however it missed a few in trip temperature and thermal_core.
      
      In current codes, the trip->temperature used "unsigned long"
      and zone->temperature used"int", if the temperature is negative
      value, it will get wrong result when compare temperature with
      trip temperature.
      
      This patch can fix it.
      Signed-off-by: NWei Ni <wni@nvidia.com>
      Signed-off-by: NEduardo Valentin <edubezval@gmail.com>
      1d0fd42f
  10. 20 4月, 2016 1 次提交
  11. 19 4月, 2016 1 次提交
    • L
      devpts: clean up interface to pty drivers · 67245ff3
      Linus Torvalds 提交于
      This gets rid of the horrible notion of having that
      
          struct inode *ptmx_inode
      
      be the linchpin of the interface between the pty code and devpts.
      
      By de-emphasizing the ptmx inode, a lot of things actually get cleaner,
      and we will have a much saner way forward.  In particular, this will
      allow us to associate with any particular devpts instance at open-time,
      and not be artificially tied to one particular ptmx inode.
      
      The patch itself is actually fairly straightforward, and apart from some
      locking and return path cleanups it's pretty mechanical:
      
       - the interfaces that devpts exposes all take "struct pts_fs_info *"
         instead of "struct inode *ptmx_inode" now.
      
         NOTE! The "struct pts_fs_info" thing is a completely opaque structure
         as far as the pty driver is concerned: it's still declared entirely
         internally to devpts. So the pty code can't actually access it in any
         way, just pass it as a "cookie" to the devpts code.
      
       - the "look up the pts fs info" is now a single clear operation, that
         also does the reference count increment on the pts superblock.
      
         So "devpts_add/del_ref()" is gone, and replaced by a "lookup and get
         ref" operation (devpts_get_ref(inode)), along with a "put ref" op
         (devpts_put_ref()).
      
       - the pty master "tty->driver_data" field now contains the pts_fs_info,
         not the ptmx inode.
      
       - because we don't care about the ptmx inode any more as some kind of
         base index, the ref counting can now drop the inode games - it just
         gets the ref on the superblock.
      
       - the pts_fs_info now has a back-pointer to the super_block. That's so
         that we can easily look up the information we actually need. Although
         quite often, the pts fs info was actually all we wanted, and not having
         to look it up based on some magical inode makes things more
         straightforward.
      
      In particular, now that "devpts_get_ref(inode)" operation should really
      be the *only* place we need to look up what devpts instance we're
      associated with, and we do it exactly once, at ptmx_open() time.
      
      The other side of this is that one ptmx node could now be associated
      with multiple different devpts instances - you could have a single
      /dev/ptmx node, and then have multiple mount namespaces with their own
      instances of devpts mounted on /dev/pts/.  And that's all perfectly sane
      in a model where we just look up the pts instance at open time.
      
      This will eventually allow us to get rid of our odd single-vs-multiple
      pts instance model, but this patch in itself changes no semantics, only
      an internal binding model.
      
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Aurelien Jarno <aurelien@aurel32.net>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Jann Horn <jann@thejh.net>
      Cc: Greg KH <greg@kroah.com>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67245ff3
  12. 16 4月, 2016 1 次提交
  13. 15 4月, 2016 4 次提交
    • C
      soreuseport: fix ordering for mixed v4/v6 sockets · d894ba18
      Craig Gallek 提交于
      With the SO_REUSEPORT socket option, it is possible to create sockets
      in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
      This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
      the AF_INET6 sockets.
      
      Prior to the commits referenced below, an incoming IPv4 packet would
      always be routed to a socket of type AF_INET when this mixed-mode was used.
      After those changes, the same packet would be routed to the most recently
      bound socket (if this happened to be an AF_INET6 socket, it would
      have an IPv4 mapped IPv6 address).
      
      The change in behavior occurred because the recent SO_REUSEPORT optimizations
      short-circuit the socket scoring logic as soon as they find a match.  They
      did not take into account the scoring logic that favors AF_INET sockets
      over AF_INET6 sockets in the event of a tie.
      
      To fix this problem, this patch changes the insertion order of AF_INET
      and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
      have SO_REUSEPORT set.  AF_INET sockets will be inserted at the head of the
      list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
      the tail of the list.  This will force AF_INET sockets to always be
      considered first.
      
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d894ba18
    • M
      ipv6: udp: Do a route lookup and update during release_cb · e646b657
      Martin KaFai Lau 提交于
      This patch adds a release_cb for UDPv6.  It does a route lookup
      and updates sk->sk_dst_cache if it is needed.  It picks up the
      left-over job from ip6_sk_update_pmtu() if the sk was owned
      by user during the pmtu update.
      
      It takes a rcu_read_lock to protect the __sk_dst_get() operations
      because another thread may do ip6_dst_store() without taking the
      sk lock (e.g. sendmsg).
      
      Fixes: 45e4fd26 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reported-by: NWei Wang <weiwan@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e646b657
    • M
      ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update · 33c162a9
      Martin KaFai Lau 提交于
      There is a case in connected UDP socket such that
      getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
      sequence could be the following:
      1. Create a connected UDP socket
      2. Send some datagrams out
      3. Receive a ICMPV6_PKT_TOOBIG
      4. No new outgoing datagrams to trigger the sk_dst_check()
         logic to update the sk->sk_dst_cache.
      5. getsockopt(IPV6_MTU) returns the mtu from the invalid
         sk->sk_dst_cache instead of the newly created RTF_CACHE clone.
      
      This patch updates the sk->sk_dst_cache for a connected datagram sk
      during pmtu-update code path.
      
      Note that the sk->sk_v6_daddr is used to do the route lookup
      instead of skb->data (i.e. iph).  It is because a UDP socket can become
      connected after sending out some datagrams in un-connected state.  or
      It can be connected multiple times to different destinations.  Hence,
      iph may not be related to where sk is currently connected to.
      
      It is done under '!sock_owned_by_user(sk)' condition because
      the user may make another ip6_datagram_connect()  (i.e changing
      the sk->sk_v6_daddr) while dst lookup is happening in the pmtu-update
      code path.
      
      For the sock_owned_by_user(sk) == true case, the next patch will
      introduce a release_cb() which will update the sk->sk_dst_cache.
      
      Test:
      
      Server (Connected UDP Socket):
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Route Details:
      [root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac'
      2fac::/64 dev eth0  proto kernel  metric 256  pref medium
      2fac:face::/64 via 2fac::face dev eth0  metric 1024  pref medium
      
      A simple python code to create a connected UDP socket:
      
      import socket
      import errno
      
      HOST = '2fac::1'
      PORT = 8080
      
      s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
      s.bind((HOST, PORT))
      s.connect(('2fac:face::face', 53))
      print("connected")
      while True:
          try:
      	data = s.recv(1024)
          except socket.error as se:
      	if se.errno == errno.EMSGSIZE:
      		pmtu = s.getsockopt(41, 24)
      		print("PMTU:%d" % pmtu)
      		break
      s.close()
      
      Python program output after getting a ICMPV6_PKT_TOOBIG:
      [root@arch-fb-vm1 ~]# python2 ~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py
      connected
      PMTU:1300
      
      Cache routes after recieving TOOBIG:
      [root@arch-fb-vm1 ~]# ip -6 r show table cache
      2fac:face::face via 2fac::face dev eth0  metric 0
          cache  expires 463sec mtu 1300 pref medium
      
      Client (Send the ICMPV6_PKT_TOOBIG):
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      scapy is used to generate the TOOBIG message.  Here is the scapy script I have
      used:
      
      >>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', type=0x86dd)/IPv6(src='2fac::face', dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac::
      1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53)
      >>> sendp(p, iface='qemubr0')
      
      Fixes: 45e4fd26 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reported-by: NWei Wang <weiwan@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33c162a9
    • L
      Make file credentials available to the seqfile interfaces · 34dbbcdb
      Linus Torvalds 提交于
      A lot of seqfile users seem to be using things like %pK that uses the
      credentials of the current process, but that is actually completely
      wrong for filesystem interfaces.
      
      The unix semantics for permission checking files is to check permissions
      at _open_ time, not at read or write time, and that is not just a small
      detail: passing off stdin/stdout/stderr to a suid application and making
      the actual IO happen in privileged context is a classic exploit
      technique.
      
      So if we want to be able to look at permissions at read time, we need to
      use the file open credentials, not the current ones.  Normal file
      accesses can just use "f_cred" (or any of the helper functions that do
      that, like file_ns_capable()), but the seqfile interfaces do not have
      any such options.
      
      It turns out that seq_file _does_ save away the user_ns information of
      the file, though.  Since user_ns is just part of the full credential
      information, replace that special case with saving off the cred pointer
      instead, and suddenly seq_file has all the permission information it
      needs.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34dbbcdb
  14. 14 4月, 2016 3 次提交
  15. 13 4月, 2016 1 次提交
  16. 12 4月, 2016 2 次提交
  17. 11 4月, 2016 1 次提交
    • M
      sctp: avoid refreshing heartbeat timer too often · ba6f5e33
      Marcelo Ricardo Leitner 提交于
      Currently on high rate SCTP streams the heartbeat timer refresh can
      consume quite a lot of resources as timer updates are costly and it
      contains a random factor, which a) is also costly and b) invalidates
      mod_timer() optimization for not editing a timer to the same value.
      It may even cause the timer to be slightly advanced, for no good reason.
      
      As suggested by David Laight this patch now removes this timer update
      from hot path by leaving the timer on and re-evaluating upon its
      expiration if the heartbeat is still needed or not, similarly to what is
      done for TCP. If it's not needed anymore the timer is re-scheduled to
      the new timeout, considering the time already elapsed.
      
      For this, we now record the last tx timestamp per transport, updated in
      the same spots as hb timer was restarted on tx. Also split up
      sctp_transport_reset_timers into sctp_transport_reset_t3_rtx and
      sctp_transport_reset_hb_timer, so we can re-arm T3 without re-arming the
      heartbeat one.
      
      On loopback with MTU of 65535 and data chunks with 1636, so that we
      have a considerable amount of chunks without stressing system calls,
      netperf -t SCTP_STREAM -l 30, perf looked like this before:
      
      Samples: 103K of event 'cpu-clock', Event count (approx.): 25833000000
        Overhead  Command  Shared Object      Symbol
      +    6,15%  netperf  [kernel.vmlinux]   [k] copy_user_enhanced_fast_string
      -    5,43%  netperf  [kernel.vmlinux]   [k] _raw_write_unlock_irqrestore
         - _raw_write_unlock_irqrestore
            - 96,54% _raw_spin_unlock_irqrestore
               - 36,14% mod_timer
                  + 97,24% sctp_transport_reset_timers
                  + 2,76% sctp_do_sm
               + 33,65% __wake_up_sync_key
               + 28,77% sctp_ulpq_tail_event
               + 1,40% del_timer
            - 1,84% mod_timer
               + 99,03% sctp_transport_reset_timers
               + 0,97% sctp_do_sm
            + 1,50% sctp_ulpq_tail_event
      
      And after this patch, now with netperf -l 60:
      
      Samples: 230K of event 'cpu-clock', Event count (approx.): 57707250000
        Overhead  Command  Shared Object      Symbol
      +    5,65%  netperf  [kernel.vmlinux]   [k] memcpy_erms
      +    5,59%  netperf  [kernel.vmlinux]   [k] copy_user_enhanced_fast_string
      -    5,05%  netperf  [kernel.vmlinux]   [k] _raw_spin_unlock_irqrestore
         - _raw_spin_unlock_irqrestore
            + 49,89% __wake_up_sync_key
            + 45,68% sctp_ulpq_tail_event
            - 2,85% mod_timer
               + 76,51% sctp_transport_reset_t3_rtx
               + 23,49% sctp_do_sm
            + 1,55% del_timer
      +    2,50%  netperf  [sctp]             [k] sctp_datamsg_from_user
      +    2,26%  netperf  [sctp]             [k] sctp_sendmsg
      
      Throughput-wise, from 6800mbps without the patch to 7050mbps with it,
      ~3.7%.
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba6f5e33