1. 02 9月, 2017 1 次提交
  2. 01 9月, 2017 6 次提交
    • J
      include/linux/compiler.h: don't perform compiletime_assert with -O0 · c03567a8
      Joe Stringer 提交于
      Commit c7acec71 ("kernel.h: handle pointers to arrays better in
      container_of()") made use of __compiletime_assert() from container_of()
      thus increasing the usage of this macro, allowing developers to notice
      type conflicts in usage of container_of() at compile time.
      
      However, the implementation of __compiletime_assert relies on compiler
      optimizations to report an error.  This means that if a developer uses
      "-O0" with any code that performs container_of(), the compiler will always
      report an error regardless of whether there is an actual problem in the
      code.
      
      This patch disables compile_time_assert when optimizations are disabled to
      allow such code to compile with CFLAGS="-O0".
      
      Example compilation failure:
      
      ./include/linux/compiler.h:547:38: error: call to `__compiletime_assert_94' declared with attribute error: pointer type mismatch in container_of()
        _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
                                            ^
      ./include/linux/compiler.h:530:4: note: in definition of macro `__compiletime_assert'
          prefix ## suffix();    \
          ^~~~~~
      ./include/linux/compiler.h:547:2: note: in expansion of macro `_compiletime_assert'
        _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
        ^~~~~~~~~~~~~~~~~~~
      ./include/linux/build_bug.h:46:37: note: in expansion of macro `compiletime_assert'
       #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                           ^~~~~~~~~~~~~~~~~~
      ./include/linux/kernel.h:860:2: note: in expansion of macro `BUILD_BUG_ON_MSG'
        BUILD_BUG_ON_MSG(!__same_type(*(ptr), ((type *)0)->member) && \
        ^~~~~~~~~~~~~~~~
      
      [akpm@linux-foundation.org: use do{}while(0), per Michal]
      Link: http://lkml.kernel.org/r/20170829230114.11662-1-joe@ovn.org
      Fixes: c7acec71 ("kernel.h: handle pointers to arrays better in container_of()")
      Signed-off-by: NJoe Stringer <joe@ovn.org>
      Cc: Ian Abbott <abbotti@mev.co.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c03567a8
    • J
      mm/mmu_notifier: kill invalidate_page · 5f32b265
      Jérôme Glisse 提交于
      The invalidate_page callback suffered from two pitfalls.  First it used
      to happen after the page table lock was release and thus a new page
      might have setup before the call to invalidate_page() happened.
      
      This is in a weird way fixed by commit c7ab0d2f ("mm: convert
      try_to_unmap_one() to use page_vma_mapped_walk()") that moved the
      callback under the page table lock but this also broke several existing
      users of the mmu_notifier API that assumed they could sleep inside this
      callback.
      
      The second pitfall was invalidate_page() being the only callback not
      taking a range of address in respect to invalidation but was giving an
      address and a page.  Lots of the callback implementers assumed this
      could never be THP and thus failed to invalidate the appropriate range
      for THP.
      
      By killing this callback we unify the mmu_notifier callback API to
      always take a virtual address range as input.
      
      Finally this also simplifies the end user life as there is now two clear
      choices:
        - invalidate_range_start()/end() callback (which allow you to sleep)
        - invalidate_range() where you can not sleep but happen right after
          page table update under page table lock
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Bernhard Held <berny156@gmx.de>
      Cc: Adam Borowski <kilobyte@angband.pl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: axie <axie@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f32b265
    • J
      dax: update to new mmu_notifier semantic · a4d1a885
      Jérôme Glisse 提交于
      Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range()
      and make sure it is bracketed by calls to *_invalidate_range_start()/end().
      
      Note that because we can not presume the pmd value or pte value we have
      to assume the worst and unconditionaly report an invalidation as
      happening.
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Bernhard Held <berny156@gmx.de>
      Cc: Adam Borowski <kilobyte@angband.pl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: axie <axie@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4d1a885
    • B
      <linux/uaccess.h>: Fix copy_in_user() declaration · f58e76c1
      Bart Van Assche 提交于
      copy_in_user() copies data from user-space address @from to user-
      space address @to. Hence declare both @from and @to as user-space
      pointers.
      
      Fixes: commit d597580d ("generic ...copy_..._user primitives")
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f58e76c1
    • C
      annotate RWF_... flags · ddef7ed2
      Christoph Hellwig 提交于
      [AV: added missing annotations in syscalls.h/compat.h]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ddef7ed2
    • A
  3. 31 8月, 2017 14 次提交
  4. 30 8月, 2017 1 次提交
  5. 29 8月, 2017 11 次提交
    • T
      Revert "libata: quirk read log on no-name M.2 SSD" · 2aca3923
      Tejun Heo 提交于
      This reverts commit 35f0b6a7.
      
      We now conditionalize issuing of READ LOG PAGE on the TRUSTED
      COMPUTING SUPPORTED bit in the identity data and this shouldn't be
      necessary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      2aca3923
    • C
      libata: check for trusted computing in IDENTIFY DEVICE data · e8f11db9
      Christoph Hellwig 提交于
      ATA-8 and later mirrors the TRUSTED COMPUTING SUPPORTED bit in word 48 of
      the IDENTIFY DEVICE data.  Check this before issuing a READ LOG PAGE
      command to avoid issues with buggy devices.  The only downside is that
      we can't support Security Send / Receive for a device with an older
      revision due to the conflicting use of this field in earlier
      specifications.
      
      tj: The reason we need this is because some devices which don't
          support READ LOG PAGE lock up after getting issued that command.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e8f11db9
    • B
      sched/completion: Avoid unnecessary stack allocation for COMPLETION_INITIALIZER_ONSTACK() · ec81048c
      Boqun Feng 提交于
      In theory, COMPLETION_INITIALIZER_ONSTACK() should never affect the
      stack allocation of the caller. However, on some compilers, a temporary
      structure was allocated for the return value of
      COMPLETION_INITIALIZER_ONSTACK().
      
      For example in write_journal() with LOCKDEP_COMPLETIONS=y (GCC is 7.1.1):
      
      	io_comp.comp = COMPLETION_INITIALIZER_ONSTACK(io_comp.comp);
      	    2462:       e8 00 00 00 00          callq  2467 <write_journal+0x47>
      	    2467:       48 8d 85 80 fd ff ff    lea    -0x280(%rbp),%rax
      	    246e:       48 c7 c6 00 00 00 00    mov    $0x0,%rsi
      	    2475:       48 c7 c2 00 00 00 00    mov    $0x0,%rdx
      		x->done = 0;
      	    247c:       c7 85 90 fd ff ff 00    movl   $0x0,-0x270(%rbp)
      	    2483:       00 00 00
      		init_waitqueue_head(&x->wait);
      	    2486:       48 8d 78 18             lea    0x18(%rax),%rdi
      	    248a:       e8 00 00 00 00          callq  248f <write_journal+0x6f>
      		if (commit_start + commit_sections <= ic->journal_sections) {
      	    248f:       41 8b 87 a8 00 00 00    mov    0xa8(%r15),%eax
      		io_comp.comp = COMPLETION_INITIALIZER_ONSTACK(io_comp.comp);
      	    2496:       48 8d bd e8 f9 ff ff    lea    -0x618(%rbp),%rdi
      	    249d:       48 8d b5 90 fd ff ff    lea    -0x270(%rbp),%rsi
      	    24a4:       b9 17 00 00 00          mov    $0x17,%ecx
      	    24a9:       f3 48 a5                rep movsq %ds:(%rsi),%es:(%rdi)
      		if (commit_start + commit_sections <= ic->journal_sections) {
      	    24ac:       41 39 c6                cmp    %eax,%r14d
      		io_comp.comp = COMPLETION_INITIALIZER_ONSTACK(io_comp.comp);
      	    24af:       48 8d bd 90 fd ff ff    lea    -0x270(%rbp),%rdi
      	    24b6:       48 8d b5 e8 f9 ff ff    lea    -0x618(%rbp),%rsi
      	    24bd:       b9 17 00 00 00          mov    $0x17,%ecx
      	    24c2:       f3 48 a5                rep movsq %ds:(%rsi),%es:(%rdi)
      
      We can obviously see the temporary structure allocated, and the compiler
      also does two meaningless memcpy with "rep movsq".
      
      And according to:
      
      	https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html#Statement-Exprs
      
      The return value of a statement expression is returned by value, so the
      temporary variable is created in COMPLETION_INITIALIZER_ONSTACK(), and
      that's why the temporary structures are allocted.
      
      To fix this, make the brace block in COMPLETION_INITIALIZER_ONSTACK()
      return a pointer and dereference it outside the block rather than return
      the whole structure, in this way, we are able to teach the compiler not
      to do the unnecessary stack allocation.
      
      This could also reduce the stack size even if !LOCKDEP, for example in
      write_journal(), compiled with gcc 7.1.1, the result of command:
      
      	 objdump -d drivers/md/dm-integrity.o | ./scripts/checkstack.pl x86
      
      before:
      
      	0x0000246a write_journal [dm-integrity.o]:              696
      
      after:
      
      	0x00002b7a write_journal [dm-integrity.o]:              296
      Reported-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NBoqun Feng <boqun.feng@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: walken@google.com
      Cc: willy@infradead.org
      Link: http://lkml.kernel.org/r/20170823152542.5150-3-boqun.feng@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ec81048c
    • Y
      smp: Avoid using two cache lines for struct call_single_data · 966a9671
      Ying Huang 提交于
      struct call_single_data is used in IPIs to transfer information between
      CPUs.  Its size is bigger than sizeof(unsigned long) and less than
      cache line size.  Currently it is not allocated with any explicit alignment
      requirements.  This makes it possible for allocated call_single_data to
      cross two cache lines, which results in double the number of the cache lines
      that need to be transferred among CPUs.
      
      This can be fixed by requiring call_single_data to be aligned with the
      size of call_single_data. Currently the size of call_single_data is the
      power of 2.  If we add new fields to call_single_data, we may need to
      add padding to make sure the size of new definition is the power of 2
      as well.
      
      Fortunately, this is enforced by GCC, which will report bad sizes.
      
      To set alignment requirements of call_single_data to the size of
      call_single_data, a struct definition and a typedef is used.
      
      To test the effect of the patch, I used the vm-scalability multiple
      thread swap test case (swap-w-seq-mt).  The test will create multiple
      threads and each thread will eat memory until all RAM and part of swap
      is used, so that huge number of IPIs are triggered when unmapping
      memory.  In the test, the throughput of memory writing improves ~5%
      compared with misaligned call_single_data, because of faster IPIs.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NHuang, Ying <ying.huang@intel.com>
      [ Add call_single_data_t and align with size of call_single_data. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/87bmnqd6lz.fsf@yhuang-mobile.sh.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      966a9671
    • P
      locking/lockdep: Untangle xhlock history save/restore from task independence · f52be570
      Peter Zijlstra 提交于
      Where XHLOCK_{SOFT,HARD} are save/restore points in the xhlocks[] to
      ensure the temporal IRQ events don't interact with task state, the
      XHLOCK_PROC is a fundament different beast that just happens to share
      the interface.
      
      The purpose of XHLOCK_PROC is to annotate independent execution inside
      one task. For example workqueues, each work should appear to run in its
      own 'pristine' 'task'.
      
      Remove XHLOCK_PROC in favour of its own interface to avoid confusion.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: boqun.feng@gmail.com
      Cc: david@fromorbit.com
      Cc: johannes@sipsolutions.net
      Cc: kernel-team@lge.com
      Cc: oleg@redhat.com
      Cc: tj@kernel.org
      Link: http://lkml.kernel.org/r/20170829085939.ggmb6xiohw67micb@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f52be570
    • K
      perf/core, x86: Add PERF_SAMPLE_PHYS_ADDR · fc7ce9c7
      Kan Liang 提交于
      For understanding how the workload maps to memory channels and hardware
      behavior, it's very important to collect address maps with physical
      addresses. For example, 3D XPoint access can only be found by filtering
      the physical address.
      
      Add a new sample type for physical address.
      
      perf already has a facility to collect data virtual address. This patch
      introduces a function to convert the virtual address to physical address.
      The function is quite generic and can be extended to any architecture as
      long as a virtual address is provided.
      
       - For kernel direct mapping addresses, virt_to_phys is used to convert
         the virtual addresses to physical address.
      
       - For user virtual addresses, __get_user_pages_fast is used to walk the
         pages tables for user physical address.
      
       - This does not work for vmalloc addresses right now. These are not
         resolved, but code to do that could be added.
      
      The new sample type requires collecting the virtual address. The
      virtual address will not be output unless SAMPLE_ADDR is applied.
      
      For security, the physical address can only be exposed to root or
      privileged user.
      Tested-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: mpe@ellerman.id.au
      Link: http://lkml.kernel.org/r/1503967969-48278-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fc7ce9c7
    • A
      perf/core, pt, bts: Get rid of itrace_started · 8d4e6c4c
      Alexander Shishkin 提交于
      I just noticed that hw.itrace_started and hw.config are aliased to the
      same location. Now, the PT driver happens to use both, which works out
      fine by sheer luck:
      
       - STORE(hw.itrace_start) is ordered before STORE(hw.config), in the
          program order, although there are no compiler barriers to ensure that,
      
       - to the perf_log_itrace_start() hw.itrace_start looks set at the same
         time as when it is intended to be set because both stores happen in the
         same path,
      
       - hw.config is never reset to zero in the PT driver.
      
      Now, the use of hw.config by the PT driver makes more sense (it being a
      HW PMU) than messing around with itrace_started, which is an awkward API
      to begin with.
      
      This patch replaces hw.itrace_started with an attach_state bit and an
      API call for the PMU drivers to use to communicate the condition.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: vince@deater.net
      Link: http://lkml.kernel.org/r/20170330153956.25994-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8d4e6c4c
    • A
      net/mlx5: Add XRQ support · 5b3ec3fc
      Artemy Kovalyov 提交于
      Add support to new XRQ(eXtended shared Receive Queue)
      hardware object. It supports SRQ semantics with addition
      of extended receive buffers topologies and offloads.
      
      Currently supports tag matching topology and rendezvouz offload.
      Signed-off-by: NArtemy Kovalyov <artemyko@mellanox.com>
      Reviewed-by: NYossi Itigin <yosefe@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      5b3ec3fc
    • A
      net/mlx5: Update HW layout definitions · 6e44636a
      Artemy Kovalyov 提交于
      * add offload_type field to mlx5_ifc_qpc_bits
      * update mlx5_ifc_xrqc_bits layout
      Signed-off-by: NArtemy Kovalyov <artemyko@mellanox.com>
      Reviewed-by: NYossi Itigin <yosefe@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      6e44636a
    • Z
      perf/ftrace: Fix double traces of perf on ftrace:function · 75e83876
      Zhou Chengming 提交于
      When running perf on the ftrace:function tracepoint, there is a bug
      which can be reproduced by:
      
        perf record -e ftrace:function -a sleep 20 &
        perf record -e ftrace:function ls
        perf script
      
                    ls 10304 [005]   171.853235: ftrace:function:
        perf_output_begin
                    ls 10304 [005]   171.853237: ftrace:function:
        perf_output_begin
                    ls 10304 [005]   171.853239: ftrace:function:
        task_tgid_nr_ns
                    ls 10304 [005]   171.853240: ftrace:function:
        task_tgid_nr_ns
                    ls 10304 [005]   171.853242: ftrace:function:
        __task_pid_nr_ns
                    ls 10304 [005]   171.853244: ftrace:function:
        __task_pid_nr_ns
      
      We can see that all the function traces are doubled.
      
      The problem is caused by the inconsistency of the register
      function perf_ftrace_event_register() with the probe function
      perf_ftrace_function_call(). The former registers one probe
      for every perf_event. And the latter handles all perf_events
      on the current cpu. So when two perf_events on the current cpu,
      the traces of them will be doubled.
      
      So this patch adds an extra parameter "event" for perf_tp_event,
      only send sample data to this event when it's not NULL.
      Signed-off-by: NZhou Chengming <zhouchengming1@huawei.com>
      Reviewed-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: alexander.shishkin@linux.intel.com
      Cc: huawei.libin@huawei.com
      Link: http://lkml.kernel.org/r/1503668977-12526-1-git-send-email-zhouchengming1@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      75e83876
    • C
      libata: quirk read log on no-name M.2 SSD · 35f0b6a7
      Christoph Hellwig 提交于
      Ido reported that reading the log page on his systems fails,
      so quirk it as it won't support ZBC or security protocols.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reported-by: NIdo Schimmel <idosch@mellanox.com>
      Tested-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      35f0b6a7
  6. 28 8月, 2017 2 次提交
    • B
      dm: fix printk() rate limiting code · 60440789
      Bart Van Assche 提交于
      Using the same rate limiting state for different kinds of messages
      is wrong because this can cause a high frequency message to suppress
      a report of a low frequency message. Hence use a unique rate limiting
      state per message type.
      
      Fixes: 71a16736 ("dm: use local printk ratelimit")
      Cc: stable@vger.kernel.org
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      60440789
    • L
      Clarify (and fix) MAX_LFS_FILESIZE macros · 0cc3b0ec
      Linus Torvalds 提交于
      We have a MAX_LFS_FILESIZE macro that is meant to be filled in by
      filesystems (and other IO targets) that know they are 64-bit clean and
      don't have any 32-bit limits in their IO path.
      
      It turns out that our 32-bit value for that limit was bogus.  On 32-bit,
      the VM layer is limited by the page cache to only 32-bit index values,
      but our logic for that was confusing and actually wrong.  We used to
      define that value to
      
      	(((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)
      
      which is actually odd in several ways: it limits the index to 31 bits,
      and then it limits files so that they can't have data in that last byte
      of a page that has the highest 31-bit index (ie page index 0x7fffffff).
      
      Neither of those limitations make sense.  The index is actually the full
      32 bit unsigned value, and we can use that whole full page.  So the
      maximum size of the file would logically be "PAGE_SIZE << BITS_PER_LONG".
      
      However, we do wan tto avoid the maximum index, because we have code
      that iterates over the page indexes, and we don't want that code to
      overflow.  So the maximum size of a file on a 32-bit host should
      actually be one page less than the full 32-bit index.
      
      So the actual limit is ULONG_MAX << PAGE_SHIFT.  That means that we will
      not actually be using the page of that last index (ULONG_MAX), but we
      can grow a file up to that limit.
      
      The wrong value of MAX_LFS_FILESIZE actually caused problems for Doug
      Nazar, who was still using a 32-bit host, but with a 9.7TB 2 x RAID5
      volume.  It turns out that our old MAX_LFS_FILESIZE was 8TiB (well, one
      byte less), but the actual true VM limit is one page less than 16TiB.
      
      This was invisible until commit c2a9737f ("vfs,mm: fix a dead loop
      in truncate_inode_pages_range()"), which started applying that
      MAX_LFS_FILESIZE limit to block devices too.
      
      NOTE! On 64-bit, the page index isn't a limiter at all, and the limit is
      actually just the offset type itself (loff_t), which is signed.  But for
      clarity, on 64-bit, just use the maximum signed value, and don't make
      people have to count the number of 'f' characters in the hex constant.
      
      So just use LLONG_MAX for the 64-bit case.  That was what the value had
      been before too, just written out as a hex constant.
      
      Fixes: c2a9737f ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
      Reported-and-tested-by: NDoug Nazar <nazard@nazar.ca>
      Cc: Andreas Dilger <adilger@dilger.ca>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cc3b0ec
  7. 25 8月, 2017 5 次提交