1. 28 1月, 2014 8 次提交
  2. 26 1月, 2014 1 次提交
    • I
      libceph: add ceph_kv{malloc,free}() and switch to them · eeb0bed5
      Ilya Dryomov 提交于
      Encapsulate kmalloc vs vmalloc memory allocation and freeing logic into
      two helpers, ceph_kvmalloc() and ceph_kvfree(), and switch to them.
      
      ceph_kvmalloc() kmalloc()'s a maximum of 8 pages, anything bigger is
      vmalloc()'ed with __GFP_HIGHMEM set.  This changes the existing
      behaviour:
      
      - for buffers (ceph_buffer_new()), from trying to kmalloc() everything
        and using vmalloc() just as a fallback
      
      - for messages (ceph_msg_new()), from going to vmalloc() for anything
        bigger than a page
      
      - for messages (ceph_msg_new()), from disallowing vmalloc() to use high
        memory
      Signed-off-by: NIlya Dryomov <ilya.dryomov@inktank.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      eeb0bed5
  3. 21 1月, 2014 3 次提交
  4. 14 1月, 2014 1 次提交
  5. 01 1月, 2014 13 次提交
  6. 14 12月, 2013 1 次提交
    • J
      libceph: block I/O when PAUSE or FULL osd map flags are set · d29adb34
      Josh Durgin 提交于
      The PAUSEWR and PAUSERD flags are meant to stop the cluster from
      processing writes and reads, respectively. The FULL flag is set when
      the cluster determines that it is out of space, and will no longer
      process writes.  PAUSEWR and PAUSERD are purely client-side settings
      already implemented in userspace clients. The osd does nothing special
      with these flags.
      
      When the FULL flag is set, however, the osd responds to all writes
      with -ENOSPC. For cephfs, this makes sense, but for rbd the block
      layer translates this into EIO.  If a cluster goes from full to
      non-full quickly, a filesystem on top of rbd will not behave well,
      since some writes succeed while others get EIO.
      
      Fix this by blocking any writes when the FULL flag is set in the osd
      client. This is the same strategy used by userspace, so apply it by
      default.  A follow-on patch makes this configurable.
      
      __map_request() is called to re-target osd requests in case the
      available osds changed.  Add a paused field to a ceph_osd_request, and
      set it whenever an appropriate osd map flag is set.  Avoid queueing
      paused requests in __map_request(), but force them to be resent if
      they become unpaused.
      
      Also subscribe to the next osd map from the monitor if any of these
      flags are set, so paused requests can be unblocked as soon as
      possible.
      
      Fixes: http://tracker.ceph.com/issues/6079Reviewed-by: NSage Weil <sage@inktank.com>
      Signed-off-by: NJosh Durgin <josh.durgin@inktank.com>
      d29adb34
  7. 03 12月, 2013 1 次提交
  8. 29 11月, 2013 1 次提交
    • S
      efivars, efi-pstore: Hold off deletion of sysfs entry until the scan is completed · e0d59733
      Seiji Aguchi 提交于
      Currently, when mounting pstore file system, a read callback of
      efi_pstore driver runs mutiple times as below.
      
      - In the first read callback, scan efivar_sysfs_list from head and pass
        a kmsg buffer of a entry to an upper pstore layer.
      - In the second read callback, rescan efivar_sysfs_list from the entry
        and pass another kmsg buffer to it.
      - Repeat the scan and pass until the end of efivar_sysfs_list.
      
      In this process, an entry is read across the multiple read function
      calls. To avoid race between the read and erasion, the whole process
      above is protected by a spinlock, holding in open() and releasing in
      close().
      
      At the same time, kmemdup() is called to pass the buffer to pstore
      filesystem during it. And then, it causes a following lockdep warning.
      
      To make the dynamic memory allocation runnable without taking spinlock,
      holding off a deletion of sysfs entry if it happens while scanning it
      via efi_pstore, and deleting it after the scan is completed.
      
      To implement it, this patch introduces two flags, scanning and deleting,
      to efivar_entry.
      
      On the code basis, it seems that all the scanning and deleting logic is
      not needed because __efivars->lock are not dropped when reading from the
      EFI variable store.
      
      But, the scanning and deleting logic is still needed because an
      efi-pstore and a pstore filesystem works as follows.
      
      In case an entry(A) is found, the pointer is saved to psi->data.  And
      efi_pstore_read() passes the entry(A) to a pstore filesystem by
      releasing  __efivars->lock.
      
      And then, the pstore filesystem calls efi_pstore_read() again and the
      same entry(A), which is saved to psi->data, is used for resuming to scan
      a sysfs-list.
      
      So, to protect the entry(A), the logic is needed.
      
      [    1.143710] ------------[ cut here ]------------
      [    1.144058] WARNING: CPU: 1 PID: 1 at kernel/lockdep.c:2740 lockdep_trace_alloc+0x104/0x110()
      [    1.144058] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
      [    1.144058] Modules linked in:
      [    1.144058] CPU: 1 PID: 1 Comm: systemd Not tainted 3.11.0-rc5 #2
      [    1.144058]  0000000000000009 ffff8800797e9ae0 ffffffff816614a5 ffff8800797e9b28
      [    1.144058]  ffff8800797e9b18 ffffffff8105510d 0000000000000080 0000000000000046
      [    1.144058]  00000000000000d0 00000000000003af ffffffff81ccd0c0 ffff8800797e9b78
      [    1.144058] Call Trace:
      [    1.144058]  [<ffffffff816614a5>] dump_stack+0x54/0x74
      [    1.144058]  [<ffffffff8105510d>] warn_slowpath_common+0x7d/0xa0
      [    1.144058]  [<ffffffff8105517c>] warn_slowpath_fmt+0x4c/0x50
      [    1.144058]  [<ffffffff8131290f>] ? vsscanf+0x57f/0x7b0
      [    1.144058]  [<ffffffff810bbd74>] lockdep_trace_alloc+0x104/0x110
      [    1.144058]  [<ffffffff81192da0>] __kmalloc_track_caller+0x50/0x280
      [    1.144058]  [<ffffffff815147bb>] ? efi_pstore_read_func.part.1+0x12b/0x170
      [    1.144058]  [<ffffffff8115b260>] kmemdup+0x20/0x50
      [    1.144058]  [<ffffffff815147bb>] efi_pstore_read_func.part.1+0x12b/0x170
      [    1.144058]  [<ffffffff81514800>] ? efi_pstore_read_func.part.1+0x170/0x170
      [    1.144058]  [<ffffffff815148b4>] efi_pstore_read_func+0xb4/0xe0
      [    1.144058]  [<ffffffff81512b7b>] __efivar_entry_iter+0xfb/0x120
      [    1.144058]  [<ffffffff8151428f>] efi_pstore_read+0x3f/0x50
      [    1.144058]  [<ffffffff8128d7ba>] pstore_get_records+0x9a/0x150
      [    1.158207]  [<ffffffff812af25c>] ? selinux_d_instantiate+0x1c/0x20
      [    1.158207]  [<ffffffff8128ce30>] ? parse_options+0x80/0x80
      [    1.158207]  [<ffffffff8128ced5>] pstore_fill_super+0xa5/0xc0
      [    1.158207]  [<ffffffff811ae7d2>] mount_single+0xa2/0xd0
      [    1.158207]  [<ffffffff8128ccf8>] pstore_mount+0x18/0x20
      [    1.158207]  [<ffffffff811ae8b9>] mount_fs+0x39/0x1b0
      [    1.158207]  [<ffffffff81160550>] ? __alloc_percpu+0x10/0x20
      [    1.158207]  [<ffffffff811c9493>] vfs_kern_mount+0x63/0xf0
      [    1.158207]  [<ffffffff811cbb0e>] do_mount+0x23e/0xa20
      [    1.158207]  [<ffffffff8115b51b>] ? strndup_user+0x4b/0xf0
      [    1.158207]  [<ffffffff811cc373>] SyS_mount+0x83/0xc0
      [    1.158207]  [<ffffffff81673cc2>] system_call_fastpath+0x16/0x1b
      [    1.158207] ---[ end trace 61981bc62de9f6f4 ]---
      Signed-off-by: NSeiji Aguchi <seiji.aguchi@hds.com>
      Tested-by: NMadper Xie <cxie@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      e0d59733
  9. 28 11月, 2013 1 次提交
  10. 26 11月, 2013 1 次提交
    • T
      ARM: tegra: Provide dummy powergate implementation · 9886e1fd
      Thierry Reding 提交于
      In order to support increased build test coverage for drivers, implement
      dummies for the powergate implementation. This will allow the drivers to
      be built without requiring support for Tegra to be selected.
      
      This patch solves the following build errors, which can be triggered in
      v3.13-rc1 by selecting DRM_TEGRA without ARCH_TEGRA:
      
      drivers/built-in.o: In function `gr3d_remove':
      drivers/gpu/drm/tegra/gr3d.c:321: undefined reference to `tegra_powergate_power_off'
      drivers/gpu/drm/tegra/gr3d.c:325: undefined reference to `tegra_powergate_power_off'
      drivers/built-in.o: In function `gr3d_probe':
      drivers/gpu/drm/tegra/gr3d.c:266: undefined reference to `tegra_powergate_sequence_power_up'
      drivers/gpu/drm/tegra/gr3d.c:273: undefined reference to `tegra_powergate_sequence_power_up'
      Signed-off-by: NThierry Reding <treding@nvidia.com>
      [swarren, updated commit description]
      Signed-off-by: NStephen Warren <swarren@nvidia.com>
      Signed-off-by: NOlof Johansson <olof@lixom.net>
      9886e1fd
  11. 25 11月, 2013 2 次提交
  12. 22 11月, 2013 3 次提交
    • K
      mm: place page->pmd_huge_pte to right union · 7aa555bf
      Kirill A. Shutemov 提交于
      I don't know what went wrong, mis-merge or something, but ->pmd_huge_pte
      placed in wrong union within struct page.
      
      In original patch[1] it's placed to union with ->lru and ->slab, but in
      commit e009bb30 ("mm: implement split page table lock for PMD
      level") it's in union with ->index and ->freelist.
      
      That union seems also unused for pages with table tables and safe to
      re-use, but it's not what I've tested.
      
      Let's move it to original place.  It fixes indentation at least.  :)
      
      [1] https://lkml.org/lkml/2013/10/7/288Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7aa555bf
    • A
      mm: hugetlbfs: fix hugetlbfs optimization · 27c73ae7
      Andrea Arcangeli 提交于
      Commit 7cb2ef56 ("mm: fix aio performance regression for database
      caused by THP") can cause dereference of a dangling pointer if
      split_huge_page runs during PageHuge() if there are updates to the
      tail_page->private field.
      
      Also it is repeating compound_head twice for hugetlbfs and it is running
      compound_head+compound_trans_head for THP when a single one is needed in
      both cases.
      
      The new code within the PageSlab() check doesn't need to verify that the
      THP page size is never bigger than the smallest hugetlbfs page size, to
      avoid memory corruption.
      
      A longstanding theoretical race condition was found while fixing the
      above (see the change right after the skip_unlock label, that is
      relevant for the compound_lock path too).
      
      By re-establishing the _mapcount tail refcounting for all compound
      pages, this also fixes the below problem:
      
        echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
      
        BUG: Bad page state in process bash  pfn:59a01
        page:ffffea000139b038 count:0 mapcount:10 mapping:          (null) index:0x0
        page flags: 0x1c00000000008000(tail)
        Modules linked in:
        CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x55/0x76
          bad_page+0xd5/0x130
          free_pages_prepare+0x213/0x280
          __free_pages+0x36/0x80
          update_and_free_page+0xc1/0xd0
          free_pool_huge_page+0xc2/0xe0
          set_max_huge_pages.part.58+0x14c/0x220
          nr_hugepages_store_common.isra.60+0xd0/0xf0
          nr_hugepages_store+0x13/0x20
          kobj_attr_store+0xf/0x20
          sysfs_write_file+0x189/0x1e0
          vfs_write+0xc5/0x1f0
          SyS_write+0x55/0xb0
          system_call_fastpath+0x16/0x1b
      Signed-off-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27c73ae7
    • D
      mm: thp: give transparent hugepage code a separate copy_page · 30b0a105
      Dave Hansen 提交于
      Right now, the migration code in migrate_page_copy() uses copy_huge_page()
      for hugetlbfs and thp pages:
      
             if (PageHuge(page) || PageTransHuge(page))
                      copy_huge_page(newpage, page);
      
      So, yay for code reuse.  But:
      
        void copy_huge_page(struct page *dst, struct page *src)
        {
              struct hstate *h = page_hstate(src);
      
      and a non-hugetlbfs page has no page_hstate().  This works 99% of the
      time because page_hstate() determines the hstate from the page order
      alone.  Since the page order of a THP page matches the default hugetlbfs
      page order, it works.
      
      But, if you change the default huge page size on the boot command-line
      (say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
      so page_hstate() returns null and copy_huge_page() oopses pretty fast
      since copy_huge_page() dereferences the hstate:
      
        void copy_huge_page(struct page *dst, struct page *src)
        {
              struct hstate *h = page_hstate(src);
              if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
        ...
      
      Mel noticed that the migration code is really the only user of these
      functions.  This moves all the copy code over to migrate.c and makes
      copy_huge_page() work for THP by checking for it explicitly.
      
      I believe the bug was introduced in commit b32967ff ("mm: numa: Add
      THP migration for the NUMA working set scanning fault case")
      
      [akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Tested-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30b0a105
  13. 21 11月, 2013 3 次提交
    • M
      net/phy: Add the autocross feature for forced links on VSC82x4 · 3fb69bca
      Madalin Bucur 提交于
      Add auto-MDI/MDI-X capability for forced (autonegotiation disabled)
      10/100 Mbps speeds on Vitesse VSC82x4 PHYs. Exported previously static
      function genphy_setup_forced() required by the new config_aneg handler
      in the Vitesse PHY module.
      Signed-off-by: NMadalin Bucur <madalin.bucur@freescale.com>
      Signed-off-by: NShruti Kanetkar <Shruti@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fb69bca
    • H
      net: rework recvmsg handler msg_name and msg_namelen logic · f3d33426
      Hannes Frederic Sowa 提交于
      This patch now always passes msg->msg_namelen as 0. recvmsg handlers must
      set msg_namelen to the proper size <= sizeof(struct sockaddr_storage)
      to return msg_name to the user.
      
      This prevents numerous uninitialized memory leaks we had in the
      recvmsg handlers and makes it harder for new code to accidentally leak
      uninitialized memory.
      
      Optimize for the case recvfrom is called with NULL as address. We don't
      need to copy the address at all, so set it to NULL before invoking the
      recvmsg handler. We can do so, because all the recvmsg handlers must
      cope with the case a plain read() is called on them. read() also sets
      msg_name to NULL.
      
      Also document these changes in include/linux/net.h as suggested by David
      Miller.
      
      Changes since RFC:
      
      Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
      non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
      affect sendto as it would bail out earlier while trying to copy-in the
      address. It also more naturally reflects the logic by the callers of
      verify_iovec.
      
      With this change in place I could remove "
      if (!uaddr || msg_sys->msg_namelen == 0)
      	msg->msg_name = NULL
      ".
      
      This change does not alter the user visible error logic as we ignore
      msg_namelen as long as msg_name is NULL.
      
      Also remove two unnecessary curly brackets in ___sys_recvmsg and change
      comments to netdev style.
      
      Cc: David Miller <davem@davemloft.net>
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3d33426
    • L
      Revert "mm: create a separate slab for page->ptl allocation" · 8b2e9b71
      Linus Torvalds 提交于
      This reverts commit ea1e7ed3.
      
      Al points out that while the commit *does* actually create a separate
      slab for the page->ptl allocation, that slab is never actually used, and
      the code continues to use kmalloc/kfree.
      
      Damien Wyart points out that the original patch did have the conversion
      to use kmem_cache_alloc/free, so it got lost somewhere on its way to me.
      
      Revert the half-arsed attempt that didn't do anything.  If we really do
      want the special slab (remember: this is all relevant just for debug
      builds, so it's not necessarily all that critical) we might as well redo
      the patch fully.
      Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Kirill A Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b2e9b71
  14. 20 11月, 2013 1 次提交
    • J
      genetlink: make multicast groups const, prevent abuse · 2a94fe48
      Johannes Berg 提交于
      Register generic netlink multicast groups as an array with
      the family and give them contiguous group IDs. Then instead
      of passing the global group ID to the various functions that
      send messages, pass the ID relative to the family - for most
      families that's just 0 because the only have one group.
      
      This avoids the list_head and ID in each group, adding a new
      field for the mcast group ID offset to the family.
      
      At the same time, this allows us to prevent abusing groups
      again like the quota and dropmon code did, since we can now
      check that a family only uses a group it owns.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a94fe48