1. 18 3月, 2014 2 次提交
  2. 27 2月, 2014 1 次提交
  3. 13 2月, 2014 2 次提交
    • L
      ACPICA: acpidump: Remove integer types translation protection. · e252652f
      Lv Zheng 提交于
      Remove translation protection for applications as Linux tools folder will
      start to use such types.
      
      In Linux kernel source tree, after removing this translation protection,
      the u8/u16/u32/u64/s32/s64 typedefs are exposed for both __KERNEL__ builds
      and !__KERNEL__ builds (tools/power/acpi) and the original definitions of
      ACPI_UINT8/16/32/64_MAX are changed.
      
      For !__KERNEL__ builds, this kind of defintions should already been tested
      by the distribution vendors that are distributing binary ACPICA package and
      we've achieved the successful built/run test result in the kernel source
      tree.
      
      For __KERNEL__ builds, there are 2 things affected:
      1. u8/u16/u32/u64/s32/s64 type definitions:
         Since Linux has already type defined u8/u16/u32/u64/s32/s64 in
         include/uapi/asm-generic/int-ll64.h for __KERNEL__.  In order not to
         introduce build regressions where the 2 typedefs are differed,
         ACPI_USE_SYSTEM_INTTYPES is introduced to mask out ACPICA's typedefs.
         It must be defined for Linux __KERNEL__ builds.
      2. ACPI_UINT8/16/32/64_MAX definitions:
         Before applying this change:
           ACPI_UINT8_MAX: sizeof (UINT8)
            UINT8: unsigned char
           ACPI_UINT16_MAX: sizeof (UINT16)
            UINT16: unsigned short
           ACPI_UINT32_MAX: sizeof (UINT32)
            INT32: int
            UINT32: unsigned int
           ACPI_UINT64_MAX: sizeof (UINT64)
            INT64: COMPILER_DEPENDENT_INT64
             COMPILER_DEPENDENT_INT64: signed long (IA64) or
                                       signed long long (IA32)
            UINT64: COMPILER_DEPENDENT_UINT64
             COMPILER_DEPENDENT_UINT64: unsigned long (IA64) or
                                        unsigned long long (IA32)
         After applying this change:
           ACPI_UINT8_MAX: sizeof (u8)
            u8: unsigned char
            UINT8: (removed from actypes.h)
           ACPI_UINT16_MAX: sizeof (u16)
            u16: unsigned short
            UINT16: (removed from actypes.h)
           ACPI_UINT32_MAX: sizeof (u32)
            INT32/UINT32: (removed from actypes.h)
            s32: signed int
            u32: unsigned int
           ACPI_UINT64_MAX: sizeof (u64)
            INT64/UINT64: (removed from actypes.h)
            u64: unsigned long long
            s64: signed long long
            COMPILER_DEPENDENT_INT64: signed long (IA64) (not used any more)
                                      signed long long (IA32) (not used any more)
            COMPILER_DEPENDENT_UINT64: unsigned long (IA64) (not used any more)
                                       unsigned long long (IA32) (not used any more)
         All definitions are equal except ACPI_UINT64_MAX for CONFIG_IA64.  It
         is changed from sizeof(unsigned long) to sizeof(unsigned long long).
         By investigation, 64bit Linux kernel build is LP64 compliant, i.e.,
         sizeof(long) and (pointer) are 64.  As sizeof(unsigned long) equals to
         sizeof(unsigned long long) on IA64 platform where CONFIG_64BIT cannot be
         disabled, this change actually will not affect the value of
         ACPI_UINT64_MAX on IA64 platforms.
      
      This patch is necessary for the ACPICA's acpidump tool to build
      correctly.  Lv Zheng.
      Signed-off-by: NLv Zheng <lv.zheng@intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      e252652f
    • L
      ACPICA: acpidump: Add sparse declarators support. · 7e66b46b
      Lv Zheng 提交于
      Linux kernel resident ACPICA headers include some sparse declarators for
      kernel static checkers.  This patch adds code to disable them for non
      __KERNEL__ defined code so that it is possible for the ACPICA user space
      tool's source files to be built with Linux kernel ACPICA header files
      included.  Lv Zheng.
      
      Linux kernel build is not affected by this commit.
      Signed-off-by: NLv Zheng <lv.zheng@intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      7e66b46b
  4. 11 2月, 2014 3 次提交
  5. 10 2月, 2014 1 次提交
    • A
      fix O_SYNC|O_APPEND syncing the wrong range on write() · d311d79d
      Al Viro 提交于
      It actually goes back to 2004 ([PATCH] Concurrent O_SYNC write support)
      when sync_page_range() had been introduced; generic_file_write{,v}() correctly
      synced
      	pos_after_write - written .. pos_after_write - 1
      but generic_file_aio_write() synced
      	pos_before_write .. pos_before_write + written - 1
      instead.  Which is not the same thing with O_APPEND, obviously.
      A couple of years later correct variant had been killed off when
      everything switched to use of generic_file_aio_write().
      
      All users of generic_file_aio_write() are affected, and the same bug
      has been copied into other instances of ->aio_write().
      
      The fix is trivial; the only subtle point is that generic_write_sync()
      ought to be inlined to avoid calculations useless for the majority of
      calls.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d311d79d
  6. 07 2月, 2014 1 次提交
    • S
      swap: add a simple detector for inappropriate swapin readahead · 579f8290
      Shaohua Li 提交于
      This is a patch to improve swap readahead algorithm.  It's from Hugh and
      I slightly changed it.
      
      Hugh's original changelog:
      
      swapin readahead does a blind readahead, whether or not the swapin is
      sequential.  This may be ok on harddisk, because large reads have
      relatively small costs, and if the readahead pages are unneeded they can
      be reclaimed easily - though, what if their allocation forced reclaim of
      useful pages? But on SSD devices large reads are more expensive than
      small ones: if the readahead pages are unneeded, reading them in caused
      significant overhead.
      
      This patch adds very simplistic random read detection.  Stealing the
      PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the
      vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages()
      simply looks at readahead's current success rate, and narrows or widens
      its readahead window accordingly.  There is little science to its
      heuristic: it's about as stupid as can be whilst remaining effective.
      
      The table below shows elapsed times (in centiseconds) when running a
      single repetitive swapping load across a 1000MB mapping in 900MB ram
      with 1GB swap (the harddisk tests had taken painfully too long when I
      used mem=500M, but SSD shows similar results for that).
      
      Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his
      Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch
      which Shaohua showed to be defective; HughNew this Nov 14 patch, with
      page_cluster as usual at default of 3 (8-page reads); HughPC4 this same
      patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0
      (1-page reads: no readahead).
      
      HDD for swapping to harddisk, SSD for swapping to VertexII SSD.  Seq for
      sequential access to the mapping, cycling five times around; Rand for
      the same number of random touches.  Anon for a MAP_PRIVATE anon mapping;
      Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.
      
      One weakness of Shaohua's vma/anon_vma approach was that it did not
      optimize Shmem: seen below.  Konstantin's approach was perhaps mistuned,
      50% slower on Seq: did not compete and is not shown below.
      
      HDD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
      Seq Anon     73921   76210   75611   76904   78191  121542
      Seq Shmem    73601   73176   73855   72947   74543  118322
      Rand Anon   895392  831243  871569  845197  846496  841680
      Rand Shmem 1058375 1053486  827935  764955  764376  756489
      
      SSD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
      Seq Anon     24634   24198   24673   25107   21614   70018
      Seq Shmem    24959   24932   25052   25703   22030   69678
      Rand Anon    43014   26146   28075   25989   26935   25901
      Rand Shmem   45349   45215   28249   24268   24138   24332
      
      These tests are, of course, two extremes of a very simple case: under
      heavier mixed loads I've not yet observed any consistent improvement or
      degradation, and wider testing would be welcome.
      
      Shaohua Li:
      
      Test shows Vanilla is slightly better in sequential workload than Hugh's
      patch.  I observed with Hugh's patch sometimes the readahead size is
      shrinked too fast (from 8 to 1 immediately) in sequential workload if
      there is no hit.  And in such case, continuing doing readahead is good
      actually.
      
      I don't prepare a sophisticated algorithm for the sequential workload
      because so far we can't guarantee sequential accessed pages are swap out
      sequentially.  So I slightly change Hugh's heuristic - don't shrink
      readahead size too fast.
      
      Here is my test result (unit second, 3 runs average):
      	Vanilla		Hugh		New
      Seq	356		370		360
      Random	4525		2447		2444
      
      Attached graph is the swapin/swapout throughput I collected with 'vmstat
      2'.  The first part is running a random workload (till around 1200 of
      the x-axis) and the second part is running a sequential workload.
      swapin and swapout throughput are almost identical in steady state in
      both workloads.  These are expected behavior.  while in Vanilla, swapin
      is much bigger than swapout especially in random workload (because wrong
      readahead).
      
      Original patches by: Shaohua Li and Konstantin Khlebnikov.
      
      [fengguang.wu@intel.com: swapin_nr_pages() can be static]
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      579f8290
  7. 06 2月, 2014 1 次提交
    • L
      execve: use 'struct filename *' for executable name passing · c4ad8f98
      Linus Torvalds 提交于
      This changes 'do_execve()' to get the executable name as a 'struct
      filename', and to free it when it is done.  This is what the normal
      users want, and it simplifies and streamlines their error handling.
      
      The controlled lifetime of the executable name also fixes a
      use-after-free problem with the trace_sched_process_exec tracepoint: the
      lifetime of the passed-in string for kernel users was not at all
      obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize
      the pathname allocation lifetime with the execve() having finished,
      which in turn meant that the trace point that happened after
      mm_release() of the old process VM ended up using already free'd memory.
      
      To solve the kernel string lifetime issue, this simply introduces
      "getname_kernel()" that works like the normal user-space getname()
      function, except with the source coming from kernel memory.
      
      As Oleg points out, this also means that we could drop the tcomm[] array
      from 'struct linux_binprm', since the pathname lifetime now covers
      setup_new_exec().  That would be a separate cleanup.
      Reported-by: NIgor Zhbanov <i.zhbanov@samsung.com>
      Tested-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4ad8f98
  8. 05 2月, 2014 1 次提交
    • A
      ACPI: introduce CONFIG_ACPI_REDUCED_HARDWARE_ONLY · af1ae78a
      Al Stone 提交于
      ACPI hardware reduced mode exists to allow newer platforms to use a
      simpler form of ACPI that does not require supporting legacy versions
      of the specification and their associated hardware.  This mode was
      introduced in the ACPI 5.0 specification.
      
      The ACPI hardware reduced mode is supposed to be used on systems
      having the HW_REDUCED_ACPI flag set in the FADT.  ACPICA checks
      that flag to determine whether or not it should work in the HW
      reduced mode and there are pieces of code in it that will never
      be used in that case.
      
      Since some architecutres will always use the ACPI HW reduced mode,
      it doesn't make sense for them to ever compile support for anything
      else.  Thus, they should set the flag ACPI_REDUCED_HARDWARE to TRUE
      in the ACPICA source.  To enable them to do that, introduce a new
      kernel configuration option, CONFIG_ACPI_REDUCED_HARDWARE_ONLY, that
      will cause the ACPICA's ACPI_REDUCED_HARDWARE flag to be TRUE when
      set.
      
      Introducing this configuration item is based on suggestions from Lv
      Zheng saying that this does not belong in ACPICA, but rather to the
      Linux kernel itself.
      
      References: http://www.spinics.net/lists/linux-acpi/msg46369.htmlSigned-off-by: NHanjun Guo <hanjun.guo@linaro.org>
      Signed-off-by: NAl Stone <al.stone@linaro.org>
      [rjw: Subject and changelog]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      af1ae78a
  9. 03 2月, 2014 1 次提交
  10. 02 2月, 2014 1 次提交
  11. 01 2月, 2014 1 次提交
  12. 31 1月, 2014 6 次提交
    • Z
      xen/grant-table: Avoid m2p_override during mapping · 08ece5bb
      Zoltan Kiss 提交于
      The grant mapping API does m2p_override unnecessarily: only gntdev needs it,
      for blkback and future netback patches it just cause a lock contention, as
      those pages never go to userspace. Therefore this series does the following:
      - the original functions were renamed to __gnttab_[un]map_refs, with a new
        parameter m2p_override
      - based on m2p_override either they follow the original behaviour, or just set
        the private flag and call set_phys_to_machine
      - gnttab_[un]map_refs are now a wrapper to call __gnttab_[un]map_refs with
        m2p_override false
      - a new function gnttab_[un]map_refs_userspace provides the old behaviour
      
      It also removes a stray space from page.h and change ret to 0 if
      XENFEAT_auto_translated_physmap, as that is the only possible return value
      there.
      
      v2:
      - move the storing of the old mfn in page->index to gnttab_map_refs
      - move the function header update to a separate patch
      
      v3:
      - a new approach to retain old behaviour where it needed
      - squash the patches into one
      
      v4:
      - move out the common bits from m2p* functions, and pass pfn/mfn as parameter
      - clear page->private before doing anything with the page, so m2p_find_override
        won't race with this
      
      v5:
      - change return value handling in __gnttab_[un]map_refs
      - remove a stray space in page.h
      - add detail why ret = 0 now at some places
      
      v6:
      - don't pass pfn to m2p* functions, just get it locally
      Signed-off-by: NZoltan Kiss <zoltan.kiss@citrix.com>
      Suggested-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      08ece5bb
    • D
      mm: sl[uo]b: fix misleading comments · 433a91ff
      Dave Hansen 提交于
      On x86, SLUB creates and handles <=8192-byte allocations internally.
      It passes larger ones up to the allocator.  Saying "up to order 2" is,
      at best, ambiguous.  Is that order-1?  Or (order-2 bytes)?  Make
      it more clear.
      
      SLOB commits a similar sin.  It *handles* page-size requests, but the
      comment says that it passes up "all page size and larger requests".
      
      SLOB also swaps around the order of the very-similarly-named
      KMALLOC_SHIFT_HIGH and KMALLOC_SHIFT_MAX #defines.  Make it
      consistent with the order of the other two allocators.
      
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      433a91ff
    • M
      zsmalloc: add copyright · 31fc00bb
      Minchan Kim 提交于
      Add my copyright to the zsmalloc source code which I maintain.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31fc00bb
    • M
      zsmalloc: move it under mm · bcf1647d
      Minchan Kim 提交于
      This patch moves zsmalloc under mm directory.
      
      Before that, description will explain why we have needed custom
      allocator.
      
      Zsmalloc is a new slab-based memory allocator for storing compressed
      pages.  It is designed for low fragmentation and high allocation success
      rate on large object, but <= PAGE_SIZE allocations.
      
      zsmalloc differs from the kernel slab allocator in two primary ways to
      achieve these design goals.
      
      zsmalloc never requires high order page allocations to back slabs, or
      "size classes" in zsmalloc terms.  Instead it allows multiple
      single-order pages to be stitched together into a "zspage" which backs
      the slab.  This allows for higher allocation success rate under memory
      pressure.
      
      Also, zsmalloc allows objects to span page boundaries within the zspage.
      This allows for lower fragmentation than could be had with the kernel
      slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE.  With the
      kernel slab allocator, if a page compresses to 60% of it original size,
      the memory savings gained through compression is lost in fragmentation
      because another object of the same size can't be stored in the leftover
      space.
      
      This ability to span pages results in zsmalloc allocations not being
      directly addressable by the user.  The user is given an
      non-dereferencable handle in response to an allocation request.  That
      handle must be mapped, using zs_map_object(), which returns a pointer to
      the mapped region that can be used.  The mapping is necessary since the
      object data may reside in two different noncontigious pages.
      
      The zsmalloc fulfills the allocation needs for zram perfectly
      
      [sjenning@linux.vnet.ibm.com: borrow Seth's quote]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NNitin Gupta <ngupta@vflare.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcf1647d
    • C
      kernel: use lockless list for smp_call_function_single · 6897fc22
      Christoph Hellwig 提交于
      Make smp_call_function_single and friends more efficient by using a
      lockless list.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6897fc22
    • Y
      memblock, bootmem: restore goal for alloc_low · 07bacb38
      Yinghai Lu 提交于
      Now we have memblock_virt_alloc_low to replace original bootmem api in
      swiotlb.
      
      But we should not use BOOTMEM_LOW_LIMIT for arch that does not support
      CONFIG_NOBOOTMEM, as old api take 0.
      
      | #define alloc_bootmem_low(x) \
      |        __alloc_bootmem_low(x, SMP_CACHE_BYTES, 0)
      |#define alloc_bootmem_low_pages_nopanic(x) \
      |        __alloc_bootmem_low_nopanic(x, PAGE_SIZE, 0)
      
      and we have
       #define BOOTMEM_LOW_LIMIT __pa(MAX_DMA_ADDRESS)
      for CONFIG_NOBOOTMEM.
      
      Restore goal to 0 to fix ia64 crash, that Tony found.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Reported-by: NTony Luck <tony.luck@gmail.com>
      Tested-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07bacb38
  13. 30 1月, 2014 6 次提交
  14. 29 1月, 2014 7 次提交
    • J
      fsnotify: Do not return merged event from fsnotify_add_notify_event() · 83c0e1b4
      Jan Kara 提交于
      The event returned from fsnotify_add_notify_event() cannot ever be used
      safely as the event may be freed by the time the function returns (after
      dropping notification_mutex). So change the prototype to just return
      whether the event was added or merged into some existing event.
      Reported-and-tested-by: NJiri Kosina <jkosina@suse.cz>
      Reported-and-tested-by: NDave Jones <davej@fedoraproject.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      83c0e1b4
    • F
      Btrfs: add support for inode properties · 63541927
      Filipe David Borba Manana 提交于
      This change adds infrastructure to allow for generic properties for
      inodes. Properties are name/value pairs that can be associated with
      inodes for different purposes. They are stored as xattrs with the
      prefix "btrfs."
      
      Properties can be inherited - this means when a directory inode has
      inheritable properties set, these are added to new inodes created
      under that directory. Further, subvolumes can also have properties
      associated with them, and they can be inherited from their parent
      subvolume. Naturally, directory properties have priority over subvolume
      properties (in practice a subvolume property is just a regular
      property associated with the root inode, objectid 256, of the
      subvolume's fs tree).
      
      This change also adds one specific property implementation, named
      "compression", whose values can be "lzo" or "zlib" and it's an
      inheritable property.
      
      The corresponding changes to btrfs-progs were also implemented.
      A patch with xfstests for this feature will follow once there's
      agreement on this change/feature.
      
      Further, the script at the bottom of this commit message was used to
      do some benchmarks to measure any performance penalties of this feature.
      
      Basically the tests correspond to:
      
      Test 1 - create a filesystem and mount it with compress-force=lzo,
      then sequentially create N files of 64Kb each, measure how long it took
      to create the files, unmount the filesystem, mount the filesystem and
      perform an 'ls -lha' against the test directory holding the N files, and
      report the time the command took.
      
      Test 2 - create a filesystem and don't use any compression option when
      mounting it - instead set the compression property of the subvolume's
      root to 'lzo'. Then create N files of 64Kb, and report the time it took.
      The unmount the filesystem, mount it again and perform an 'ls -lha' like
      in the former test. This means every single file ends up with a property
      (xattr) associated to it.
      
      Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
      compression property, have no real effect other than adding more work
      when inheriting properties and taking more btree leaf space.
      
      Test 4 - same as test 3 but with 10 properties per file.
      
      Results (in seconds, and averages of 5 runs each), for different N
      numbers of files follow.
      
      * Without properties (test 1)
      
                          file creation time        ls -lha time
      10 000 files              3.49                   0.76
      100 000 files            47.19                   8.37
      1 000 000 files         518.51                 107.06
      
      * With 1 property (compression property set to lzo - test 2)
      
                          file creation time        ls -lha time
      10 000 files              3.63                    0.93
      100 000 files            48.56                    9.74
      1 000 000 files         537.72                  125.11
      
      * With 4 properties (test 3)
      
                          file creation time        ls -lha time
      10 000 files              3.94                    1.20
      100 000 files            52.14                   11.48
      1 000 000 files         572.70                  142.13
      
      * With 10 properties (test 4)
      
                          file creation time        ls -lha time
      10 000 files              4.61                    1.35
      100 000 files            58.86                   13.83
      1 000 000 files         656.01                  177.61
      
      The increased latencies with properties are essencialy because of:
      
      *) When creating an inode, we now synchronously write 1 more item
         (an xattr item) for each property inherited from the parent dir
         (or subvolume). This could be done in an asynchronous way such
         as we do for dir intex items (delayed-inode.c), which could help
         reduce the file creation latency;
      
      *) With properties, we now have larger fs trees. For this particular
         test each xattr item uses 75 bytes of leaf space in the fs tree.
         This could be less by using a new item for xattr items, instead of
         the current btrfs_dir_item, since we could cut the 'location' and
         'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
         total of 26 bytes per xattr item) from the btrfs_dir_item type.
      
      Also tried batching the xattr insertions (ignoring proper hash
      collision handling, since it didn't exist) when creating files that
      inherit properties from their parent inode/subvolume, but the end
      results were (surprisingly) essentially the same.
      
      Test script:
      
      $ cat test.pl
        #!/usr/bin/perl -w
      
        use strict;
        use Time::HiRes qw(time);
        use constant NUM_FILES => 10_000;
        use constant FILE_SIZES => (64 * 1024);
        use constant DEV => '/dev/sdb4';
        use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
        use constant TEST_DIR => (MNT_POINT . '/testdir');
      
        system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
      
        # following line for testing without properties
        #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        # following 2 lines for testing with properties
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
        system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
      
        system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
        my ($t1, $t2);
      
        $t1 = time();
        for (my $i = 1; $i <= NUM_FILES; $i++) {
            my $p = TEST_DIR . '/file_' . $i;
            open(my $f, '>', $p) or die "Error opening file!";
            $f->autoflush(1);
            for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
                print $f ('A' x 4096) or die "Error writing to file!";
            }
            close($f);
        }
        $t2 = time();
        print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        $t1 = time();
        system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
        $t2 = time();
        print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      63541927
    • J
      rwsem: add rwsem_is_contended · 4a444b1f
      Josef Bacik 提交于
      Btrfs needs a simple way to know if it needs to let go of it's read lock on a
      rwsem.  Introduce rwsem_is_contended to check to see if there are any waiters on
      this rwsem currently.  This is just a hueristic, it is meant to be light and not
      100% accurate and called by somebody already holding on to the rwsem in either
      read or write.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      4a444b1f
    • L
      Btrfs/tracepoint: update new flags for ordered extent TP · 792ddef0
      Liu Bo 提交于
      Flag BTRFS_ORDERED_TRUNCATED is a new one, update the tracepoint to
      support it.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      792ddef0
    • L
      Btrfs/tracepoint: fix to report right flags for ordered extent · 9d04a8ce
      Liu Bo 提交于
      We use set_bit() to assign ordered extent's flags, but in the related
      tracepoint we don't do the same thing, which makes the trace output
      not to parse flags correctly.
      
      Also, since the flags are bits stuff, we change to use __print_flags with
      a 'delim' instead of __print_symbolic.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9d04a8ce
    • J
      btrfs: add ioctl to export size of global metadata reservation · 01e219e8
      Jeff Mahoney 提交于
      btrfs filesystem df output will show the size of the metadata space
      and how much of it is used, and the user assumes that the difference
      is all usable space. Since that's not actually the case due to the
      global metadata reservation, we should provide the full picture to the
      user.
      
      This patch adds an ioctl that exports the size of the global metadata
      reservation so that btrfs filesystem df can report it.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      01e219e8
    • J
      btrfs: add ioctls to query/change feature bits online · 2eaa055f
      Jeff Mahoney 提交于
      There are some feature bits that require no offline setup and can
      be enabled online. I've only reviewed extended irefs, but there will
      probably be more.
      
      We introduce three new ioctls:
      - BTRFS_IOC_GET_SUPPORTED_FEATURES: query the kernel for supported features.
      - BTRFS_IOC_GET_FEATURES: query the kernel for enabled features on a per-fs
        basis, as well as querying for which features are changeable with mounted.
      - BTRFS_IOC_SET_FEATURES: change features on a per-fs basis.
      
      We introduce two new masks per feature set (_SAFE_SET and _SAFE_CLEAR) that
      allow us to define which features are safe to change at runtime.
      
      The failure modes for BTRFS_IOC_SET_FEATURES are as follows:
      - Enabling a completely unsupported feature: warns and returns -ENOTSUPP
      - Enabling a feature that can only be done offline: warns and returns -EPERM
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2eaa055f
  15. 28 1月, 2014 6 次提交