1. 09 9月, 2015 17 次提交
  2. 08 9月, 2015 2 次提交
  3. 05 9月, 2015 21 次提交
    • O
      mm: move ->mremap() from file_operations to vm_operations_struct · 5477e70a
      Oleg Nesterov 提交于
      vma->vm_ops->mremap() looks more natural and clean in move_vma(), and this
      way ->mremap() can have more users.  Say, vdso.
      
      While at it, s/aio_ring_remap/aio_ring_mremap/.
      
      Note: this is the minimal change before ->mremap() finds another user in
      file_operations; this method should have more arguments, and it can be
      used to kill arch_remap().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5477e70a
    • A
      userfaultfd: avoid missing wakeups during refile in userfaultfd_read · 2c5b7e1b
      Andrea Arcangeli 提交于
      During the refile in userfaultfd_read both waitqueues could look empty to
      the lockless wake_userfault().  Use a seqcount to prevent this false
      negative that could leave an userfault blocked.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c5b7e1b
    • A
      userfaultfd: allow signals to interrupt a userfault · dfa37dc3
      Andrea Arcangeli 提交于
      This is only simple to achieve if the userfault is going to return to
      userland (not to the kernel) because we can avoid returning VM_FAULT_RETRY
      despite we temporarily released the mmap_sem.  The fault would just be
      retried by userland then.  This is safe at least on x86 and powerpc (the
      two archs with the syscall implemented so far).
      
      Hint to verify for which archs this is safe: after handle_mm_fault
      returns, no access to data structures protected by the mmap_sem must be
      done by the fault code in arch/*/mm/fault.c until up_read(&mm->mmap_sem)
      is called.
      
      This has two main benefits: signals can run with lower latency in
      production (signals aren't blocked by userfaults and userfaults are
      immediately repeated after signal processing) and gdb can then trivially
      debug the threads blocked in this kind of userfaults coming directly from
      userland.
      
      On a side note: while gdb has a need to get signal processed, coredumps
      always worked perfectly with userfaults, no matter if the userfault is
      triggered by GUP a kernel copy_user or directly from userland.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfa37dc3
    • A
      userfaultfd: require UFFDIO_API before other ioctls · e6485a47
      Andrea Arcangeli 提交于
      UFFDIO_API was already forced before read/poll could work.  This makes the
      code more strict to force it also for all other ioctls.
      
      All users would already have been required to call UFFDIO_API before
      invoking other ioctls but this makes it more explicit.
      
      This will ensure we can change all ioctls (all but UFFDIO_API/struct
      uffdio_api) with a bump of uffdio_api.api.
      
      There's no actual plan or need to change the API or the ioctl, the current
      API already should cover fine even the non cooperative usage, but this is
      just for the longer term future just in case.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6485a47
    • A
      userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE · ad465cae
      Andrea Arcangeli 提交于
      These two ioctl allows to either atomically copy or to map zeropages
      into the virtual address space. This is used by the thread that opened
      the userfaultfd to resolve the userfaults.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad465cae
    • A
      userfaultfd: buildsystem activation · a14c151e
      Andrea Arcangeli 提交于
      This allows to select the userfaultfd during configuration to build it.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a14c151e
    • A
      userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read · 8d2afd96
      Andrea Arcangeli 提交于
      Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
      userfaultfd_read if they are run on different threads simultaneously.
      
      Until now qemu solved the race in userland: the race was explicitly
      and intentionally left for userland to solve. However we can also
      solve it in kernel.
      
      Requiring all users to solve this race if they use two threads (one
      for the background transfer and one for the userfault reads) isn't
      very attractive from an API prospective, furthermore this allows to
      remove a whole bunch of mutex and bitmap code from qemu, making it
      faster. The cost of __get_user_pages_fast should be insignificant
      considering it scales perfectly and the pagetables are already hot in
      the CPU cache, compared to the overhead in userland to maintain those
      structures.
      
      Applying this patch is backwards compatible with respect to the
      userfaultfd userland API, however reverting this change wouldn't be
      backwards compatible anymore.
      
      Without this patch qemu in the background transfer thread, has to read
      the old state, and do UFFDIO_WAKE if old_state is missing but it
      become REQUESTED by the time it tries to set it to RECEIVED (signaling
      the other side received an userfault).
      
          vcpu                background_thr userfault_thr
          -----               -----          -----
          vcpu0 handle_mm_fault()
      
                              postcopy_place_page
                              read old_state -> MISSING
                              UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
      
          vcpu0 fault at 0x7fb76a139000 enters handle_userfault
          poll() is kicked
      
                                              poll() -> POLLIN
                                              read() -> 0x7fb76a139000
                                              postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED
      
                              tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
                              /* check that no userfault raced with UFFDIO_COPY */
                              if (old_state == MISSING && tmp_state == REQUESTED)
                                      UFFDIO_WAKE from background thread
      
      And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:
      
          vcpu                background_thr userfault_thr
          -----               -----          -----
          vcpu0 handle_mm_fault()
      
                              postcopy_place_page
                              read old_state -> MISSING
                              UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
                              tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED
      
          vcpu0 fault at 0x7fb76a139000 enters handle_userfault
          poll() is kicked
      
                                              poll() -> POLLIN
                                              read() -> 0x7fb76a139000
      
                                              if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
                                                      UFFDIO_WAKE from userfault thread
      
      This patch removes the need of both UFFDIO_WAKE and of the associated
      per-page tristate as well.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d2afd96
    • A
      userfaultfd: allocate the userfaultfd_ctx cacheline aligned · 3004ec9c
      Andrea Arcangeli 提交于
      Use proper slab to guarantee alignment.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3004ec9c
    • A
      userfaultfd: optimize read() and poll() to be O(1) · 15b726ef
      Andrea Arcangeli 提交于
      This makes read O(1) and poll that was already O(1) becomes lockless.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15b726ef
    • A
      userfaultfd: wake pending userfaults · ba85c702
      Andrea Arcangeli 提交于
      This is an optimization but it's a userland visible one and it affects
      the API.
      
      The downside of this optimization is that if you call poll() and you
      get POLLIN, read(ufd) may still return -EAGAIN. The blocked userfault
      may be waken by a different thread, before read(ufd) comes
      around. This in short means that poll() isn't really usable if the
      userfaultfd is opened in blocking mode.
      
      userfaults won't wait in "pending" state to be read anymore and any
      UFFDIO_WAKE or similar operations that has the objective of waking
      userfaults after their resolution, will wake all blocked userfaults
      for the resolved range, including those that haven't been read() by
      userland yet.
      
      The behavior of poll() becomes not standard, but this obviates the
      need of "spurious" UFFDIO_WAKE and it lets the userland threads to
      restart immediately without requiring an UFFDIO_WAKE. This is even
      more significant in case of repeated faults on the same address from
      multiple threads.
      
      This optimization is justified by the measurement that the number of
      spurious UFFDIO_WAKE accounts for 5% and 10% of the total
      userfaults for heavy workloads, so it's worth optimizing those away.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba85c702
    • A
      userfaultfd: change the read API to return a uffd_msg · a9b85f94
      Andrea Arcangeli 提交于
      I had requests to return the full address (not the page aligned one) to
      userland.
      
      It's not entirely clear how the page offset could be relevant because
      userfaults aren't like SIGBUS that can sigjump to a different place and it
      actually skip resolving the fault depending on a page offset.  There's
      currently no real way to skip the fault especially because after a
      UFFDIO_COPY|ZEROPAGE, the fault is optimized to be retried within the
      kernel without having to return to userland first (not even self modifying
      code replacing the .text that touched the faulting address would prevent
      the fault to be repeated).  Userland cannot skip repeating the fault even
      more so if the fault was triggered by a KVM secondary page fault or any
      get_user_pages or any copy-user inside some syscall which will return to
      kernel code.  The second time FAULT_FLAG_RETRY_NOWAIT won't be set leading
      to a SIGBUS being raised because the userfault can't wait if it cannot
      release the mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for
      that).
      
      Still returning userland a proper structure during the read() on the uffd,
      can allow to use the current UFFD_API for the future non-cooperative
      extensions too and it looks cleaner as well.  Once we get additional
      fields there's no point to return the fault address page aligned anymore
      to reuse the bits below PAGE_SHIFT.
      
      The only downside is that the read() syscall will read 32bytes instead of
      8bytes but that's not going to be measurable overhead.
      
      The total number of new events that can be extended or of new future bits
      for already shipped events, is limited to 64 by the features field of the
      uffdio_api structure.  If more will be needed a bump of UFFD_API will be
      required.
      
      [akpm@linux-foundation.org: use __packed]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9b85f94
    • P
      userfaultfd: Rename uffd_api.bits into .features · 3f602d27
      Pavel Emelyanov 提交于
      This is (seems to be) the minimal thing that is required to unblock
      standard uffd usage from the non-cooperative one.  Now more bits can be
      added to the features field indicating e.g.  UFFD_FEATURE_FORK and others
      needed for the latter use-case.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f602d27
    • A
      userfaultfd: add new syscall to provide memory externalization · 86039bd3
      Andrea Arcangeli 提交于
      Once an userfaultfd has been created and certain region of the process
      virtual address space have been registered into it, the thread responsible
      for doing the memory externalization can manage the page faults in
      userland by talking to the kernel using the userfaultfd protocol.
      
      poll() can be used to know when there are new pending userfaults to be
      read (POLLIN).
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86039bd3
    • A
      userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP · 16ba6f81
      Andrea Arcangeli 提交于
      These two flags gets set in vma->vm_flags to tell the VM common code
      if the userfaultfd is armed and in which mode (only tracking missing
      faults, only tracking wrprotect faults or both). If neither flags is
      set it means the userfaultfd is not armed on the vma.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16ba6f81
    • K
      fs: create and use seq_show_option for escaping · a068acf2
      Kees Cook 提交于
      Many file systems that implement the show_options hook fail to correctly
      escape their output which could lead to unescaped characters (e.g.  new
      lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files.  This
      could lead to confusion, spoofed entries (resulting in things like
      systemd issuing false d-bus "mount" notifications), and who knows what
      else.  This looks like it would only be the root user stepping on
      themselves, but it's possible weird things could happen in containers or
      in other situations with delegated mount privileges.
      
      Here's an example using overlay with setuid fusermount trusting the
      contents of /proc/mounts (via the /etc/mtab symlink).  Imagine the use
      of "sudo" is something more sneaky:
      
        $ BASE="ovl"
        $ MNT="$BASE/mnt"
        $ LOW="$BASE/lower"
        $ UP="$BASE/upper"
        $ WORK="$BASE/work/ 0 0
        none /proc fuse.pwn user_id=1000"
        $ mkdir -p "$LOW" "$UP" "$WORK"
        $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt
        $ cat /proc/mounts
        none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0
        none /proc fuse.pwn user_id=1000 0 0
        $ fusermount -u /proc
        $ cat /proc/mounts
        cat: /proc/mounts: No such file or directory
      
      This fixes the problem by adding new seq_show_option and
      seq_show_option_n helpers, and updating the vulnerable show_option
      handlers to use them as needed.  Some, like SELinux, need to be open
      coded due to unusual existing escape mechanisms.
      
      [akpm@linux-foundation.org: add lost chunk, per Kees]
      [keescook@chromium.org: seq_show_option should be using const parameters]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Acked-by: NJan Kara <jack@suse.com>
      Acked-by: NPaul Moore <paul@paul-moore.com>
      Cc: J. R. Okajima <hooanon05g@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a068acf2
    • J
      ocfs2: clean up redundant NULL checks before kfree · 46359295
      Joseph Qi 提交于
      NULL check before kfree is redundant and so clean them up.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46359295
    • J
      ocfs2: neaten do_error, ocfs2_error and ocfs2_abort · 7ecef14a
      Joe Perches 提交于
      These uses sometimes do and sometimes don't have '\n' terminations.  Make
      the uses consistently use '\n' terminations and remove the newline from
      the functions.
      
      Miscellanea:
      
      o Coalesce formats
      o Realign arguments
      Signed-off-by: NJoe Perches <joe@perches.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ecef14a
    • X
      ocfs2: do not set fs read-only if rec[0] is empty while committing truncate · d0c97d52
      Xue jiufei 提交于
      While appending an extent to a file, it will call these functions:
      ocfs2_insert_extent
      
        -> call ocfs2_grow_tree() if there's no free rec
           -> ocfs2_add_branch add a new branch to extent tree,
              now rec[0] in the leaf of rightmost path is empty
        -> ocfs2_do_insert_extent
           -> ocfs2_rotate_tree_right
             -> ocfs2_extend_rotate_transaction
                -> jbd2_journal_restart if jbd2_journal_extend fail
           -> ocfs2_insert_path
              -> ocfs2_extend_trans
                -> jbd2_journal_restart if jbd2_journal_extend fail
              -> ocfs2_insert_at_leaf
           -> ocfs2_et_update_clusters
      Function jbd2_journal_restart() may be called and it may happened that
      buffers dirtied in ocfs2_add_branch() are committed
      while buffers dirtied in ocfs2_insert_at_leaf() and
      ocfs2_et_update_clusters() are not.
      So an empty rec[0] is left in rightmost path which will cause
      read-only filesystem when call ocfs2_commit_truncate()
      with the error message: "Inode %lu has an empty extent record".
      
      This is not a serious problem, so remove the rightmost path when call
      ocfs2_commit_truncate().
      Signed-off-by: Njoyce.xue <xuejiufei@huawei.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0c97d52
    • Y
      ocfs2: call ocfs2_journal_access_di() before ocfs2_journal_dirty() in ocfs2_write_end_nolock() · 7f27ec97
      yangwenfang 提交于
      1: After we call ocfs2_journal_access_di() in ocfs2_write_begin(),
         jbd2_journal_restart() may also be called, in this function transaction
         A's t_updates-- and obtains a new transaction B.  If
         jbd2_journal_commit_transaction() is happened to commit transaction A,
         when t_updates==0, it will continue to complete commit and unfile
         buffer.
      
         So when jbd2_journal_dirty_metadata(), the handle is pointed a new
         transaction B, and the buffer head's journal head is already freed,
         jh->b_transaction == NULL, jh->b_next_transaction == NULL, it returns
         EINVAL, So it triggers the BUG_ON(status).
      
      thread 1                                          jbd2
      ocfs2_write_begin                     jbd2_journal_commit_transaction
      ocfs2_write_begin_nolock
        ocfs2_start_trans
          jbd2__journal_start(t_updates+1,
                             transaction A)
          ocfs2_journal_access_di
          ocfs2_write_cluster_by_desc
            ocfs2_mark_extent_written
              ocfs2_change_extent_flag
                ocfs2_split_extent
                  ocfs2_extend_rotate_transaction
                    jbd2_journal_restart
                    (t_updates-1,transaction B) t_updates==0
                                              __jbd2_journal_refile_buffer
                                              (jh->b_transaction = NULL)
      ocfs2_write_end
      ocfs2_write_end_nolock
          ocfs2_journal_dirty
              jbd2_journal_dirty_metadata(bug)
         ocfs2_commit_trans
      
      2.  In ext4, I found that: jbd2_journal_get_write_access() called by
         ext4_write_end.
      
      ext4_write_begin
          ext4_journal_start
              __ext4_journal_start_sb
                  ext4_journal_check_start
                  jbd2__journal_start
      
      ext4_write_end
          ext4_mark_inode_dirty
              ext4_reserve_inode_write
                  ext4_journal_get_write_access
                      jbd2_journal_get_write_access
              ext4_mark_iloc_dirty
                  ext4_do_update_inode
                      ext4_handle_dirty_metadata
                          jbd2_journal_dirty_metadata
      
      3. So I think we should put ocfs2_journal_access_di before
         ocfs2_journal_dirty in the ocfs2_write_end.  and it works well after my
         modification.
      Signed-off-by: Nvicky <vicky.yangwenfang@huawei.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Zhangguanghui <zhang.guanghui@h3c.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f27ec97
    • T
      ocfs2: use 64bit variables to track heartbeat time · 40476b82
      Tina Ruchandani 提交于
      o2hb_elapsed_msecs computes the time taken for a disk heartbeat.
      'struct timeval' variables are used to store start and end times.  On
      32-bit systems, the 'tv_sec' component of 'struct timeval' will overflow
      in year 2038 and beyond.
      
      This patch solves the overflow with the following:
      
      1. Replace o2hb_elapsed_msecs using 'ktime_t' values to measure start
         and end time, and built-in function 'ktime_ms_delta' to compute the
         elapsed time.  ktime_get_real() is used since the code prints out the
         wallclock time.
      
      2. Changes format string to print time as a single 64-bit nanoseconds
         value ("%lld") instead of seconds and microseconds.  This simplifies
         the code since converting ktime_t to that format would need expensive
         computation.  However, the debug log string is less readable than the
         previous format.
      Signed-off-by: NTina Ruchandani <ruchandani.tina@gmail.com>
      Suggested by: Arnd Bergmann <arnd@arndb.de>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40476b82
    • J
      ocfs2: fix race between crashed dio and rm · ad694821
      Joseph Qi 提交于
      There is a race case between crashed dio and rm, which will lead to
      OCFS2_VALID_FL not set read-only.
      
        N1                              N2
        ------------------------------------------------------------------------
        dd with direct flag
                                        rm file
        crashed with an dio entry left
        in orphan dir
                                        clear OCFS2_VALID_FL in
                                        ocfs2_remove_inode
                                        recover N1 and read the corrupted inode,
                                        and set filesystem read-only
      
      So we skip the inode deletion this time and wait for dio entry recovered
      first.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad694821