1. 26 5月, 2012 1 次提交
    • C
      arch/tile: allow building Linux with transparent huge pages enabled · 73636b1a
      Chris Metcalf 提交于
      The change adds some infrastructure for managing tile pmd's more generally,
      using pte_pmd() and pmd_pte() methods to translate pmd values to and
      from ptes, since on TILEPro a pmd is really just a nested structure
      holding a pgd (aka pte).  Several existing pmd methods are moved into
      this framework, and a whole raft of additional pmd accessors are defined
      that are used by the transparent hugepage framework.
      
      The tile PTE now has a "client2" bit.  The bit is used to indicate a
      transparent huge page is in the process of being split into subpages.
      
      This change also fixes a generic bug where the return value of the
      generic pmdp_splitting_flush() was incorrect.
      Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
      73636b1a
  2. 01 5月, 2012 1 次提交
  3. 24 4月, 2012 1 次提交
  4. 03 4月, 2012 1 次提交
    • P
      asm-generic: add linux/types.h to cmpxchg.h · 80da6a4f
      Paul Gortmaker 提交于
      Builds of the openrisc or1ksim_defconfig show the following:
      
        In file included from arch/openrisc/include/generated/asm/cmpxchg.h:1:0,
                         from include/asm-generic/atomic.h:18,
                         from arch/openrisc/include/generated/asm/atomic.h:1,
                         from include/linux/atomic.h:4,
                         from include/linux/dcache.h:4,
                         from fs/notify/fsnotify.c:19:
        include/asm-generic/cmpxchg.h: In function '__xchg':
        include/asm-generic/cmpxchg.h:34:20: error: expected ')' before 'u8'
        include/asm-generic/cmpxchg.h:34:20: warning: type defaults to 'int' in type name
      
      and many more lines of similar errors.  It seems specific to the or32
      because most other platforms have an arch specific component that would
      have already included types.h ahead of time, but the o32 does not.
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Jonas Bonn <jonas@southpole.se>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Acked-by: David Howells <dhowells@redhat.com
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80da6a4f
  5. 29 3月, 2012 8 次提交
  6. 28 3月, 2012 1 次提交
    • C
      compat: use sys_sendfile64() implementation for sendfile syscall · 1631fcea
      Chris Metcalf 提交于
      <asm-generic/unistd.h> was set up to use sys_sendfile() for the 32-bit
      compat API instead of sys_sendfile64(), but in fact the right thing to
      do is to use sys_sendfile64() in all cases.  The 32-bit sendfile64() API
      in glibc uses the sendfile64 syscall, so it has to be capable of doing
      full 64-bit operations.  But the sys_sendfile() kernel implementation
      has a MAX_NON_LFS test in it which explicitly limits the offset to 2^32.
      So, we need to use the sys_sendfile64() implementation in the kernel
      for this case.
      
      Cc: <stable@kernel.org>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
      1631fcea
  7. 26 3月, 2012 1 次提交
    • P
      params: <level>_initcall-like kernel parameters · 026cee00
      Pawel Moll 提交于
      This patch adds a set of macros that can be used to declare
      kernel parameters to be parsed _before_ initcalls at a chosen
      level are executed.  We rename the now-unused "flags" field of
      struct kernel_param as the level.  It's signed, for when we
      use this for early params as well, in future.
      
      Linker macro collating init calls had to be modified in order
      to add additional symbols between levels that are later used
      by the init code to split the calls into blocks.
      Signed-off-by: NPawel Moll <pawel.moll@arm.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      026cee00
  8. 24 3月, 2012 2 次提交
    • J
      coredump: add VM_NODUMP, MADV_NODUMP, MADV_CLEAR_NODUMP · accb61fe
      Jason Baron 提交于
      Since we no longer need the VM_ALWAYSDUMP flag, let's use the freed bit
      for 'VM_NODUMP' flag.  The idea is is to add a new madvise() flag:
      MADV_DONTDUMP, which can be set by applications to specifically request
      memory regions which should not dump core.
      
      The specific application I have in mind is qemu: we can add a flag there
      that wouldn't dump all of guest memory when qemu dumps core.  This flag
      might also be useful for security sensitive apps that want to absolutely
      make sure that parts of memory are not dumped.  To clear the flag use:
      MADV_DODUMP.
      
      [akpm@linux-foundation.org: s/MADV_NODUMP/MADV_DONTDUMP/, s/MADV_CLEAR_NODUMP/MADV_DODUMP/, per Roland]
      [akpm@linux-foundation.org: fix up the architectures which broke]
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Acked-by: NRoland McGrath <roland@hack.frob.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      accb61fe
    • J
      consolidate WARN_...ONCE() static variables · 7ccaba53
      Jan Beulich 提交于
      Due to the alignment of following variables, these typically consume
      more than just the single byte that 'bool' requires, and as there are a
      few hundred instances, the cache pollution (not so much the waste of
      memory) sums up.  Put these variables into their own section, outside of
      any half way frequently used memory range.
      
      Do the same also to the __warned variable of rcu_lockdep_assert().
      (Don't, however, include the ones used by printk_once() and alike, as
      they can potentially be hot.)
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ccaba53
  9. 22 3月, 2012 1 次提交
    • A
      mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode · 1a5a9906
      Andrea Arcangeli 提交于
      In some cases it may happen that pmd_none_or_clear_bad() is called with
      the mmap_sem hold in read mode.  In those cases the huge page faults can
      allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
      false positive from pmd_bad() that will not like to see a pmd
      materializing as trans huge.
      
      It's not khugepaged causing the problem, khugepaged holds the mmap_sem
      in write mode (and all those sites must hold the mmap_sem in read mode
      to prevent pagetables to go away from under them, during code review it
      seems vm86 mode on 32bit kernels requires that too unless it's
      restricted to 1 thread per process or UP builds).  The race is only with
      the huge pagefaults that can convert a pmd_none() into a
      pmd_trans_huge().
      
      Effectively all these pmd_none_or_clear_bad() sites running with
      mmap_sem in read mode are somewhat speculative with the page faults, and
      the result is always undefined when they run simultaneously.  This is
      probably why it wasn't common to run into this.  For example if the
      madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
      fault, the hugepage will not be zapped, if the page fault runs first it
      will be zapped.
      
      Altering pmd_bad() not to error out if it finds hugepmds won't be enough
      to fix this, because zap_pmd_range would then proceed to call
      zap_pte_range (which would be incorrect if the pmd become a
      pmd_trans_huge()).
      
      The simplest way to fix this is to read the pmd in the local stack
      (regardless of what we read, no need of actual CPU barriers, only
      compiler barrier needed), and be sure it is not changing under the code
      that computes its value.  Even if the real pmd is changing under the
      value we hold on the stack, we don't care.  If we actually end up in
      zap_pte_range it means the pmd was not none already and it was not huge,
      and it can't become huge from under us (khugepaged locking explained
      above).
      
      All we need is to enforce that there is no way anymore that in a code
      path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
      can run into a hugepmd.  The overhead of a barrier() is just a compiler
      tweak and should not be measurable (I only added it for THP builds).  I
      don't exclude different compiler versions may have prevented the race
      too by caching the value of *pmd on the stack (that hasn't been
      verified, but it wouldn't be impossible considering
      pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
      and there's no external function called in between pmd_trans_huge and
      pmd_none_or_clear_bad).
      
      		if (pmd_trans_huge(*pmd)) {
      			if (next-addr != HPAGE_PMD_SIZE) {
      				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
      				split_huge_page_pmd(vma->vm_mm, pmd);
      			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
      				continue;
      			/* fall through */
      		}
      		if (pmd_none_or_clear_bad(pmd))
      
      Because this race condition could be exercised without special
      privileges this was reported in CVE-2012-1179.
      
      The race was identified and fully explained by Ulrich who debugged it.
      I'm quoting his accurate explanation below, for reference.
      
      ====== start quote =======
            mapcount 0 page_mapcount 1
            kernel BUG at mm/huge_memory.c:1384!
      
          At some point prior to the panic, a "bad pmd ..." message similar to the
          following is logged on the console:
      
            mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
      
          The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
          the page's PMD table entry.
      
              143 void pmd_clear_bad(pmd_t *pmd)
              144 {
          ->  145         pmd_ERROR(*pmd);
              146         pmd_clear(pmd);
              147 }
      
          After the PMD table entry has been cleared, there is an inconsistency
          between the actual number of PMD table entries that are mapping the page
          and the page's map count (_mapcount field in struct page). When the page
          is subsequently reclaimed, __split_huge_page() detects this inconsistency.
      
             1381         if (mapcount != page_mapcount(page))
             1382                 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
             1383                        mapcount, page_mapcount(page));
          -> 1384         BUG_ON(mapcount != page_mapcount(page));
      
          The root cause of the problem is a race of two threads in a multithreaded
          process. Thread B incurs a page fault on a virtual address that has never
          been accessed (PMD entry is zero) while Thread A is executing an madvise()
          system call on a virtual address within the same 2 MB (huge page) range.
      
                     virtual address space
                    .---------------------.
                    |                     |
                    |                     |
                  .-|---------------------|
                  | |                     |
                  | |                     |<-- B(fault)
                  | |                     |
            2 MB  | |/////////////////////|-.
            huge <  |/////////////////////|  > A(range)
            page  | |/////////////////////|-'
                  | |                     |
                  | |                     |
                  '-|---------------------|
                    |                     |
                    |                     |
                    '---------------------'
      
          - Thread A is executing an madvise(..., MADV_DONTNEED) system call
            on the virtual address range "A(range)" shown in the picture.
      
          sys_madvise
            // Acquire the semaphore in shared mode.
            down_read(&current->mm->mmap_sem)
            ...
            madvise_vma
              switch (behavior)
              case MADV_DONTNEED:
                   madvise_dontneed
                     zap_page_range
                       unmap_vmas
                         unmap_page_range
                           zap_pud_range
                             zap_pmd_range
                               //
                               // Assume that this huge page has never been accessed.
                               // I.e. content of the PMD entry is zero (not mapped).
                               //
                               if (pmd_trans_huge(*pmd)) {
                                   // We don't get here due to the above assumption.
                               }
                               //
                               // Assume that Thread B incurred a page fault and
                   .---------> // sneaks in here as shown below.
                   |           //
                   |           if (pmd_none_or_clear_bad(pmd))
                   |               {
                   |                 if (unlikely(pmd_bad(*pmd)))
                   |                     pmd_clear_bad
                   |                     {
                   |                       pmd_ERROR
                   |                         // Log "bad pmd ..." message here.
                   |                       pmd_clear
                   |                         // Clear the page's PMD entry.
                   |                         // Thread B incremented the map count
                   |                         // in page_add_new_anon_rmap(), but
                   |                         // now the page is no longer mapped
                   |                         // by a PMD entry (-> inconsistency).
                   |                     }
                   |               }
                   |
                   v
          - Thread B is handling a page fault on virtual address "B(fault)" shown
            in the picture.
      
          ...
          do_page_fault
            __do_page_fault
              // Acquire the semaphore in shared mode.
              down_read_trylock(&mm->mmap_sem)
              ...
              handle_mm_fault
                if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
                    // We get here due to the above assumption (PMD entry is zero).
                    do_huge_pmd_anonymous_page
                      alloc_hugepage_vma
                        // Allocate a new transparent huge page here.
                      ...
                      __do_huge_pmd_anonymous_page
                        ...
                        spin_lock(&mm->page_table_lock)
                        ...
                        page_add_new_anon_rmap
                          // Here we increment the page's map count (starts at -1).
                          atomic_set(&page->_mapcount, 0)
                        set_pmd_at
                          // Here we set the page's PMD entry which will be cleared
                          // when Thread A calls pmd_clear_bad().
                        ...
                        spin_unlock(&mm->page_table_lock)
      
          The mmap_sem does not prevent the race because both threads are acquiring
          it in shared mode (down_read).  Thread B holds the page_table_lock while
          the page's map count and PMD table entry are updated.  However, Thread A
          does not synchronize on that lock.
      
      ====== end quote =======
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Reported-by: NUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Jones <davej@redhat.com>
      Acked-by: NLarry Woodman <lwoodman@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>		[2.6.38+]
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a5a9906
  10. 05 3月, 2012 1 次提交
    • P
      BUG: headers with BUG/BUG_ON etc. need linux/bug.h · 187f1882
      Paul Gortmaker 提交于
      If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
      other BUG variant in a static inline (i.e. not in a #define) then
      that header really should be including <linux/bug.h> and not just
      expecting it to be implicitly present.
      
      We can make this change risk-free, since if the files using these
      headers didn't have exposure to linux/bug.h already, they would have
      been causing compile failures/warnings.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      187f1882
  11. 03 3月, 2012 1 次提交
  12. 27 2月, 2012 1 次提交
    • J
      [PARISC] fix compile break caused by iomap: make IOPORT/PCI mapping functions conditional · 97a29d59
      James Bottomley 提交于
      The problem in
      
      commit fea80311
      Author: Randy Dunlap <rdunlap@xenotime.net>
      Date:   Sun Jul 24 11:39:14 2011 -0700
      
          iomap: make IOPORT/PCI mapping functions conditional
      
      is that if your architecture supplies pci_iomap/pci_iounmap, it expects
      always to supply them.  Adding empty body defitions in the !CONFIG_PCI
      case, which is what this patch does, breaks the parisc compile because
      the functions become doubly defined.  It took us a while to spot this,
      because we don't actually build !CONFIG_PCI very often (only if someone
      is brave enough to test the snake/asp machines).
      
      Since the note in the commit log says this is to fix a
      CONFIG_GENERIC_IOMAP issue (which it does because CONFIG_GENERIC_IOMAP
      supplies pci_iounmap only if CONFIG_PCI is set), there should actually
      have been a condition upon this.  This should make sure no other
      architecture's !CONFIG_PCI compile breaks in the same way as parisc.
      
      The fix had to be updated to take account of the GENERIC_PCI_IOMAP
      separation.
      Reported-by: NRolf Eike Beer <eike@sf-mail.de>
      Signed-off-by: NJames Bottomley <JBottomley@Parallels.com>
      97a29d59
  13. 25 2月, 2012 2 次提交
    • O
      epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree() · d80e731e
      Oleg Nesterov 提交于
      This patch is intentionally incomplete to simplify the review.
      It ignores ep_unregister_pollwait() which plays with the same wqh.
      See the next change.
      
      epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
      f_op->poll() needs. In particular it assumes that the wait queue
      can't go away until eventpoll_release(). This is not true in case
      of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
      which is not connected to the file.
      
      This patch adds the special event, POLLFREE, currently only for
      epoll. It expects that init_poll_funcptr()'ed hook should do the
      necessary cleanup. Perhaps it should be defined as EPOLLFREE in
      eventpoll.
      
      __cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
      ->signalfd_wqh is not empty, we add the new signalfd_cleanup()
      helper.
      
      ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
      This make this poll entry inconsistent, but we don't care. If you
      share epoll fd which contains our sigfd with another process you
      should blame yourself. signalfd is "really special". I simply do
      not know how we can define the "right" semantics if it used with
      epoll.
      
      The main problem is, epoll calls signalfd_poll() once to establish
      the connection with the wait queue, after that signalfd_poll(NULL)
      returns the different/inconsistent results depending on who does
      EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
      has nothing to do with the file, it works with the current thread.
      
      In short: this patch is the hack which tries to fix the symptoms.
      It also assumes that nobody can take tasklist_lock under epoll
      locks, this seems to be true.
      
      Note:
      
      	- we do not have wake_up_all_poll() but wake_up_poll()
      	  is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.
      
      	- signalfd_cleanup() uses POLLHUP along with POLLFREE,
      	  we need a couple of simple changes in eventpoll.c to
      	  make sure it can't be "lost".
      Reported-by: NMaxime Bizon <mbizon@freebox.fr>
      Cc: <stable@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d80e731e
    • J
      bitops: Add missing parentheses to new get_order macro · b893485d
      Joerg Roedel 提交于
      The new get_order macro introcuded in commit
      
      	d66acc39
      
      does not use parentheses around all uses of the parameter n.
      This causes new compile warnings, for example in the
      amd_iommu_init.c function:
      
      drivers/iommu/amd_iommu_init.c:561:6: warning: suggest parentheses around comparison in operand of ‘&’ [-Wparentheses]
      drivers/iommu/amd_iommu_init.c:561:6: warning: suggest parentheses around comparison in operand of ‘&’ [-Wparentheses]
      
      Fix those warnings by adding the missing parentheses.
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Cc: David Howells <dhowells@redhat.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
      Link: http://lkml.kernel.org/r/1330088295-28732-1-git-send-email-joerg.roedel@amd.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      b893485d
  14. 24 2月, 2012 4 次提交
  15. 22 2月, 2012 2 次提交
    • H
      asm-generic: architecture independent readq/writeq for 32bit environment · 797a796a
      Hitoshi Mitake 提交于
      This provides unified readq()/writeq() helper functions for 32-bit
      drivers.
      
      For some cases, readq/writeq without atomicity is harmful, and order of
      io access has to be specified explicitly.  So in this patch, new two
      header files which contain non-atomic readq/writeq are added.
      
       - <asm-generic/io-64-nonatomic-lo-hi.h> provides non-atomic readq/
         writeq with the order of lower address -> higher address
      
       - <asm-generic/io-64-nonatomic-hi-lo.h> provides non-atomic readq/
         writeq with reversed order
      
      This allows us to remove some readq()s that were added drivers when the
      default non-atomic ones were removed in commit dbee8a0a ("x86:
      remove 32-bit versions of readq()/writeq()")
      
      The drivers which need readq/writeq but can do with the non-atomic ones
      must add the line:
      
        #include <asm-generic/io-64-nonatomic-lo-hi.h> /* or hi-lo.h */
      
      But this will be nop in 64-bit environments, and no other #ifdefs are
      required.  So I believe that this patch can solve the problem of
       1. driver-specific readq/writeq
       2. atomicity and order of io access
      
      This patch is tested with building allyesconfig and allmodconfig as
      ARCH=x86 and ARCH=i386 on top of tip/master.
      
      Cc: Kashyap Desai <Kashyap.Desai@lsi.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Ravi Anand <ravi.anand@qlogic.com>
      Cc: Vikas Chaudhary <vikas.chaudhary@qlogic.com>
      Cc: Matthew Garrett <mjg@redhat.com>
      Cc: Jason Uhlenkott <juhlenko@akamai.com>
      Cc: James Bottomley <James.Bottomley@parallels.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Roland Dreier <roland@purestorage.com>
      Cc: James Bottomley <jbottomley@parallels.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NHitoshi Mitake <h.mitake@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      797a796a
    • P
      sock: Introduce the SO_PEEK_OFF sock option · ef64a54f
      Pavel Emelyanov 提交于
      This one specifies where to start MSG_PEEK-ing queue data from. When
      set to negative value means that MSG_PEEK works as ususally -- peeks
      from the head of the queue always.
      
      When some bytes are peeked from queue and the peeking offset is non
      negative it is moved forward so that the next peek will return next
      portion of data.
      
      When non-peeking recvmsg occurs and the peeking offset is non negative
      is is moved backward so that the next peek will still peek the proper
      data (i.e. the one that would have been picked if there were no non
      peeking recv in between).
      
      The offset is set using per-proto opteration to let the protocol handle
      the locking issues and to check whether the peeking offset feature is
      supported by the protocol the socket belongs to.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef64a54f
  16. 21 2月, 2012 3 次提交
    • D
      bitops: Optimise get_order() · d66acc39
      David Howells 提交于
      Optimise get_order() to use bit scanning instructions if such exist rather than
      a loop.  Also, make it possible to use get_order() in static initialisations
      too by building it on top of ilog2() in the constant parameter case.
      
      This has been tested for i386 and x86_64 using the following userspace program,
      and for FRV by making appropriate substitutions for fls() and fls64().  It will
      abort if the case for get_order() deviates from the original except for the
      order of 0, for which get_order() produces an undefined result.  This program
      tests both dynamic and static parameters.
      
      	#include <stdlib.h>
      	#include <stdio.h>
      
      	#ifdef __x86_64__
      	#define BITS_PER_LONG 64
      	#else
      	#define BITS_PER_LONG 32
      	#endif
      
      	#define PAGE_SHIFT 12
      
      	typedef unsigned long long __u64, u64;
      	typedef unsigned int __u32, u32;
      	#define noinline	__attribute__((noinline))
      
      	static inline int fls(int x)
      	{
      		int bitpos = -1;
      
      		asm("bsrl %1,%0"
      		    : "+r" (bitpos)
      		    : "rm" (x));
      		return bitpos + 1;
      	}
      
      	static __always_inline int fls64(__u64 x)
      	{
      	#if BITS_PER_LONG == 64
      		long bitpos = -1;
      
      		asm("bsrq %1,%0"
      		    : "+r" (bitpos)
      		    : "rm" (x));
      		return bitpos + 1;
      	#else
      		__u32 h = x >> 32, l = x;
      		int bitpos = -1;
      
      		asm("bsrl	%1,%0	\n"
      		    "subl	%2,%0	\n"
      		    "bsrl	%3,%0	\n"
      		    : "+r" (bitpos)
      		    : "rm" (l), "i"(32), "rm" (h));
      
      		return bitpos + 33;
      	#endif
      	}
      
      	static inline __attribute__((const))
      	int __ilog2_u32(u32 n)
      	{
      		return fls(n) - 1;
      	}
      
      	static inline __attribute__((const))
      	int __ilog2_u64(u64 n)
      	{
      		return fls64(n) - 1;
      	}
      
      	extern __attribute__((const, noreturn))
      	int ____ilog2_NaN(void);
      
      	#define ilog2(n)				\
      	(						\
      		__builtin_constant_p(n) ? (		\
      			(n) < 1 ? ____ilog2_NaN() :	\
      			(n) & (1ULL << 63) ? 63 :	\
      			(n) & (1ULL << 62) ? 62 :	\
      			(n) & (1ULL << 61) ? 61 :	\
      			(n) & (1ULL << 60) ? 60 :	\
      			(n) & (1ULL << 59) ? 59 :	\
      			(n) & (1ULL << 58) ? 58 :	\
      			(n) & (1ULL << 57) ? 57 :	\
      			(n) & (1ULL << 56) ? 56 :	\
      			(n) & (1ULL << 55) ? 55 :	\
      			(n) & (1ULL << 54) ? 54 :	\
      			(n) & (1ULL << 53) ? 53 :	\
      			(n) & (1ULL << 52) ? 52 :	\
      			(n) & (1ULL << 51) ? 51 :	\
      			(n) & (1ULL << 50) ? 50 :	\
      			(n) & (1ULL << 49) ? 49 :	\
      			(n) & (1ULL << 48) ? 48 :	\
      			(n) & (1ULL << 47) ? 47 :	\
      			(n) & (1ULL << 46) ? 46 :	\
      			(n) & (1ULL << 45) ? 45 :	\
      			(n) & (1ULL << 44) ? 44 :	\
      			(n) & (1ULL << 43) ? 43 :	\
      			(n) & (1ULL << 42) ? 42 :	\
      			(n) & (1ULL << 41) ? 41 :	\
      			(n) & (1ULL << 40) ? 40 :	\
      			(n) & (1ULL << 39) ? 39 :	\
      			(n) & (1ULL << 38) ? 38 :	\
      			(n) & (1ULL << 37) ? 37 :	\
      			(n) & (1ULL << 36) ? 36 :	\
      			(n) & (1ULL << 35) ? 35 :	\
      			(n) & (1ULL << 34) ? 34 :	\
      			(n) & (1ULL << 33) ? 33 :	\
      			(n) & (1ULL << 32) ? 32 :	\
      			(n) & (1ULL << 31) ? 31 :	\
      			(n) & (1ULL << 30) ? 30 :	\
      			(n) & (1ULL << 29) ? 29 :	\
      			(n) & (1ULL << 28) ? 28 :	\
      			(n) & (1ULL << 27) ? 27 :	\
      			(n) & (1ULL << 26) ? 26 :	\
      			(n) & (1ULL << 25) ? 25 :	\
      			(n) & (1ULL << 24) ? 24 :	\
      			(n) & (1ULL << 23) ? 23 :	\
      			(n) & (1ULL << 22) ? 22 :	\
      			(n) & (1ULL << 21) ? 21 :	\
      			(n) & (1ULL << 20) ? 20 :	\
      			(n) & (1ULL << 19) ? 19 :	\
      			(n) & (1ULL << 18) ? 18 :	\
      			(n) & (1ULL << 17) ? 17 :	\
      			(n) & (1ULL << 16) ? 16 :	\
      			(n) & (1ULL << 15) ? 15 :	\
      			(n) & (1ULL << 14) ? 14 :	\
      			(n) & (1ULL << 13) ? 13 :	\
      			(n) & (1ULL << 12) ? 12 :	\
      			(n) & (1ULL << 11) ? 11 :	\
      			(n) & (1ULL << 10) ? 10 :	\
      			(n) & (1ULL <<  9) ?  9 :	\
      			(n) & (1ULL <<  8) ?  8 :	\
      			(n) & (1ULL <<  7) ?  7 :	\
      			(n) & (1ULL <<  6) ?  6 :	\
      			(n) & (1ULL <<  5) ?  5 :	\
      			(n) & (1ULL <<  4) ?  4 :	\
      			(n) & (1ULL <<  3) ?  3 :	\
      			(n) & (1ULL <<  2) ?  2 :	\
      			(n) & (1ULL <<  1) ?  1 :	\
      			(n) & (1ULL <<  0) ?  0 :	\
      			____ilog2_NaN()			\
      					   ) :		\
      		(sizeof(n) <= 4) ?			\
      		__ilog2_u32(n) :			\
      		__ilog2_u64(n)				\
      	 )
      
      	static noinline __attribute__((const))
      	int old_get_order(unsigned long size)
      	{
      		int order;
      
      		size = (size - 1) >> (PAGE_SHIFT - 1);
      		order = -1;
      		do {
      			size >>= 1;
      			order++;
      		} while (size);
      		return order;
      	}
      
      	static noinline __attribute__((const))
      	int __get_order(unsigned long size)
      	{
      		int order;
      		size--;
      		size >>= PAGE_SHIFT;
      	#if BITS_PER_LONG == 32
      		order = fls(size);
      	#else
      		order = fls64(size);
      	#endif
      		return order;
      	}
      
      	#define get_order(n)						\
      	(								\
      		__builtin_constant_p(n) ? (				\
      			(n == 0UL) ? BITS_PER_LONG - PAGE_SHIFT :	\
      			((n < (1UL << PAGE_SHIFT)) ? 0 :		\
      			 ilog2((n) - 1) - PAGE_SHIFT + 1)		\
      		) :							\
      		__get_order(n)						\
      	)
      
      	#define order(N) \
      		{ (1UL << N) - 1,	get_order((1UL << N) - 1)	},	\
      		{ (1UL << N),		get_order((1UL << N))		},	\
      		{ (1UL << N) + 1,	get_order((1UL << N) + 1)	}
      
      	struct order {
      		unsigned long n, order;
      	};
      
      	static const struct order order_table[] = {
      		order(0),
      		order(1),
      		order(2),
      		order(3),
      		order(4),
      		order(5),
      		order(6),
      		order(7),
      		order(8),
      		order(9),
      		order(10),
      		order(11),
      		order(12),
      		order(13),
      		order(14),
      		order(15),
      		order(16),
      		order(17),
      		order(18),
      		order(19),
      		order(20),
      		order(21),
      		order(22),
      		order(23),
      		order(24),
      		order(25),
      		order(26),
      		order(27),
      		order(28),
      		order(29),
      		order(30),
      		order(31),
      	#if BITS_PER_LONG == 64
      		order(32),
      		order(33),
      		order(34),
      		order(35),
      	#endif
      		{ 0x2929 }
      	};
      
      	void check(int loop, unsigned long n)
      	{
      		unsigned long old, new;
      
      		printf("[%2d]: %09lx | ", loop, n);
      
      		old = old_get_order(n);
      		new = get_order(n);
      
      		printf("%3ld, %3ld\n", old, new);
      		if (n != 0 && old != new)
      			abort();
      	}
      
      	int main(int argc, char **argv)
      	{
      		const struct order *p;
      		unsigned long n;
      		int loop;
      
      		for (loop = 0; loop <= BITS_PER_LONG - 1; loop++) {
      			n = 1UL << loop;
      			check(loop, n - 1);
      			check(loop, n);
      			check(loop, n + 1);
      		}
      
      		for (p = order_table; p->n != 0x2929; p++) {
      			unsigned long old, new;
      
      			old = old_get_order(p->n);
      			new = p->order;
      			printf("%09lx\t%3ld, %3ld\n", p->n, old, new);
      			if (p->n != 0 && old != new)
      				abort();
      		}
      
      		return 0;
      	}
      
      Disassembling the x86_64 version of the above code shows:
      
      	0000000000400510 <old_get_order>:
      	  400510:       48 83 ef 01             sub    $0x1,%rdi
      	  400514:       b8 ff ff ff ff          mov    $0xffffffff,%eax
      	  400519:       48 c1 ef 0b             shr    $0xb,%rdi
      	  40051d:       0f 1f 00                nopl   (%rax)
      	  400520:       83 c0 01                add    $0x1,%eax
      	  400523:       48 d1 ef                shr    %rdi
      	  400526:       75 f8                   jne    400520 <old_get_order+0x10>
      	  400528:       f3 c3                   repz retq
      	  40052a:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
      
      	0000000000400530 <__get_order>:
      	  400530:       48 83 ef 01             sub    $0x1,%rdi
      	  400534:       48 c7 c0 ff ff ff ff    mov    $0xffffffffffffffff,%rax
      	  40053b:       48 c1 ef 0c             shr    $0xc,%rdi
      	  40053f:       48 0f bd c7             bsr    %rdi,%rax
      	  400543:       83 c0 01                add    $0x1,%eax
      	  400546:       c3                      retq
      	  400547:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
      	  40054e:       00 00
      
      As can be seen, the new __get_order() function is simpler than the
      old_get_order() function.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Link: http://lkml.kernel.org/r/20120220223928.16199.29548.stgit@warthog.procyon.org.ukAcked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      d66acc39
    • D
      bitops: Adjust the comment on get_order() to describe the size==0 case · e0891a98
      David Howells 提交于
      Adjust the comment on get_order() to note that the result of passing a size of
      0 results in an undefined value.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Link: http://lkml.kernel.org/r/20120220223917.16199.9416.stgit@warthog.procyon.org.ukAcked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      e0891a98
    • H
      posix_types: Introduce __kernel_[u]long_t · afead38d
      H. Peter Anvin 提交于
      Introduce __kernel_[u]long_t, which allows an ABI to override all
      defaults of type [unsigned] long.
      
      This enables x32 and potentially other 32-bit userspace on 64-bit
      kernel ABIs.
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      afead38d
  17. 15 2月, 2012 3 次提交
  18. 01 2月, 2012 1 次提交
  19. 13 1月, 2012 1 次提交
    • S
      thp: add tlb_remove_pmd_tlb_entry · f21760b1
      Shaohua Li 提交于
      We have tlb_remove_tlb_entry to indicate a pte tlb flush entry should be
      flushed, but not a corresponding API for pmd entry.  This isn't a
      problem so far because THP is only for x86 currently and tlb_flush()
      under x86 will flush entire TLB.  But this is confusion and could be
      missed if thp is ported to other arch.
      
      Also convert tlb->need_flush = 1 to a VM_BUG_ON(!tlb->need_flush) in
      __tlb_remove_page() as suggested by Andrea Arcangeli.  The
      __tlb_remove_page() function is supposed to be called after
      tlb_remove_xxx_tlb_entry() and we can catch any misuse.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f21760b1
  20. 05 1月, 2012 1 次提交
  21. 04 1月, 2012 1 次提交
  22. 30 12月, 2011 1 次提交
    • A
      procfs: do not confuse jiffies with cputime64_t · 34845636
      Andreas Schwab 提交于
      Commit 2a95ea6c ("procfs: do not overflow get_{idle,iowait}_time
      for nohz") did not take into account that one some architectures jiffies
      and cputime use different units.
      
      This causes get_idle_time() to return numbers in the wrong units, making
      the idle time fields in /proc/stat wrong.
      
      Instead of converting the usec value returned by
      get_cpu_{idle,iowait}_time_us to units of jiffies, use the new function
      usecs_to_cputime64 to convert it to the correct unit of cputime64_t.
      Signed-off-by: NAndreas Schwab <schwab@linux-m68k.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Artem S. Tashkinov" <t.artem@mailcity.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34845636
  23. 15 12月, 2011 1 次提交