1. 11 2月, 2014 1 次提交
  2. 10 2月, 2014 7 次提交
  3. 07 2月, 2014 2 次提交
    • J
      inet: defines IPPROTO_* needed for module alias generation · ee262ad8
      Jan Moskyto Matejka 提交于
      Commit cfd280c9 ("net: sync some IP headers with glibc") changed a set of
      define's to an enum (with no explanation why) which introduced a bug
      in module mip6 where aliases are generated using the IPPROTO_* defines;
      mip6 doesn't load if require_module called with the aliases from
      xfrm_get_type().
      
      Reverting this change back to define's to fix the aliases.
      
      modinfo mip6 (before this change)
      alias:          xfrm-type-10-IPPROTO_DSTOPTS
      alias:          xfrm-type-10-IPPROTO_ROUTING
      
      modinfo mip6 (after this change)
      alias:          xfrm-type-10-43
      alias:          xfrm-type-10-60
      Signed-off-by: NJan Moskyto Matejka <mq@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee262ad8
    • S
      swap: add a simple detector for inappropriate swapin readahead · 579f8290
      Shaohua Li 提交于
      This is a patch to improve swap readahead algorithm.  It's from Hugh and
      I slightly changed it.
      
      Hugh's original changelog:
      
      swapin readahead does a blind readahead, whether or not the swapin is
      sequential.  This may be ok on harddisk, because large reads have
      relatively small costs, and if the readahead pages are unneeded they can
      be reclaimed easily - though, what if their allocation forced reclaim of
      useful pages? But on SSD devices large reads are more expensive than
      small ones: if the readahead pages are unneeded, reading them in caused
      significant overhead.
      
      This patch adds very simplistic random read detection.  Stealing the
      PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the
      vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages()
      simply looks at readahead's current success rate, and narrows or widens
      its readahead window accordingly.  There is little science to its
      heuristic: it's about as stupid as can be whilst remaining effective.
      
      The table below shows elapsed times (in centiseconds) when running a
      single repetitive swapping load across a 1000MB mapping in 900MB ram
      with 1GB swap (the harddisk tests had taken painfully too long when I
      used mem=500M, but SSD shows similar results for that).
      
      Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his
      Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch
      which Shaohua showed to be defective; HughNew this Nov 14 patch, with
      page_cluster as usual at default of 3 (8-page reads); HughPC4 this same
      patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0
      (1-page reads: no readahead).
      
      HDD for swapping to harddisk, SSD for swapping to VertexII SSD.  Seq for
      sequential access to the mapping, cycling five times around; Rand for
      the same number of random touches.  Anon for a MAP_PRIVATE anon mapping;
      Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.
      
      One weakness of Shaohua's vma/anon_vma approach was that it did not
      optimize Shmem: seen below.  Konstantin's approach was perhaps mistuned,
      50% slower on Seq: did not compete and is not shown below.
      
      HDD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
      Seq Anon     73921   76210   75611   76904   78191  121542
      Seq Shmem    73601   73176   73855   72947   74543  118322
      Rand Anon   895392  831243  871569  845197  846496  841680
      Rand Shmem 1058375 1053486  827935  764955  764376  756489
      
      SSD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
      Seq Anon     24634   24198   24673   25107   21614   70018
      Seq Shmem    24959   24932   25052   25703   22030   69678
      Rand Anon    43014   26146   28075   25989   26935   25901
      Rand Shmem   45349   45215   28249   24268   24138   24332
      
      These tests are, of course, two extremes of a very simple case: under
      heavier mixed loads I've not yet observed any consistent improvement or
      degradation, and wider testing would be welcome.
      
      Shaohua Li:
      
      Test shows Vanilla is slightly better in sequential workload than Hugh's
      patch.  I observed with Hugh's patch sometimes the readahead size is
      shrinked too fast (from 8 to 1 immediately) in sequential workload if
      there is no hit.  And in such case, continuing doing readahead is good
      actually.
      
      I don't prepare a sophisticated algorithm for the sequential workload
      because so far we can't guarantee sequential accessed pages are swap out
      sequentially.  So I slightly change Hugh's heuristic - don't shrink
      readahead size too fast.
      
      Here is my test result (unit second, 3 runs average):
      	Vanilla		Hugh		New
      Seq	356		370		360
      Random	4525		2447		2444
      
      Attached graph is the swapin/swapout throughput I collected with 'vmstat
      2'.  The first part is running a random workload (till around 1200 of
      the x-axis) and the second part is running a sequential workload.
      swapin and swapout throughput are almost identical in steady state in
      both workloads.  These are expected behavior.  while in Vanilla, swapin
      is much bigger than swapout especially in random workload (because wrong
      readahead).
      
      Original patches by: Shaohua Li and Konstantin Khlebnikov.
      
      [fengguang.wu@intel.com: swapin_nr_pages() can be static]
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      579f8290
  4. 06 2月, 2014 6 次提交
    • P
      netfilter: nf_tables: fix racy rule deletion · 0165d932
      Pablo Neira Ayuso 提交于
      We may lost race if we flush the rule-set (which happens asynchronously
      via call_rcu) and we try to remove the table (that userspace assumes
      to be empty).
      
      Fix this by recovering synchronous rule and chain deletion. This was
      introduced time ago before we had no batch support, and synchronous
      rule deletion performance was not good. Now that we have the batch
      support, we can just postpone the purge of old rule in a second step
      in the commit phase. All object deletions are synchronous after this
      patch.
      
      As a side effect, we save memory as we don't need rcu_head per rule
      anymore.
      
      Cc: Patrick McHardy <kaber@trash.net>
      Reported-by: NArturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      0165d932
    • P
      netfilter: nf_tables: add reject module for NFPROTO_INET · 05513e9e
      Patrick McHardy 提交于
      Add a reject module for NFPROTO_INET. It does nothing but dispatch
      to the AF-specific modules based on the hook family.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      05513e9e
    • P
      netfilter: nft_reject: split up reject module into IPv4 and IPv6 specifc parts · cc4723ca
      Patrick McHardy 提交于
      Currently the nft_reject module depends on symbols from ipv6. This is
      wrong since no generic module should force IPv6 support to be loaded.
      Split up the module into AF-specific and a generic part.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      cc4723ca
    • P
      netfilter: nf_tables: add AF specific expression support · 64d46806
      Patrick McHardy 提交于
      For the reject module, we need to add AF-specific implementations to
      get rid of incorrect module dependencies. Try to load an AF-specific
      module first and fall back to generic modules.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      64d46806
    • L
      execve: use 'struct filename *' for executable name passing · c4ad8f98
      Linus Torvalds 提交于
      This changes 'do_execve()' to get the executable name as a 'struct
      filename', and to free it when it is done.  This is what the normal
      users want, and it simplifies and streamlines their error handling.
      
      The controlled lifetime of the executable name also fixes a
      use-after-free problem with the trace_sched_process_exec tracepoint: the
      lifetime of the passed-in string for kernel users was not at all
      obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize
      the pathname allocation lifetime with the execve() having finished,
      which in turn meant that the trace point that happened after
      mm_release() of the old process VM ended up using already free'd memory.
      
      To solve the kernel string lifetime issue, this simply introduces
      "getname_kernel()" that works like the normal user-space getname()
      function, except with the source coming from kernel memory.
      
      As Oleg points out, this also means that we could drop the tcomm[] array
      from 'struct linux_binprm', since the pathname lifetime now covers
      setup_new_exec().  That would be a separate cleanup.
      Reported-by: NIgor Zhbanov <i.zhbanov@samsung.com>
      Tested-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4ad8f98
    • P
      netfilter: nf_conntrack: don't release a conntrack with non-zero refcnt · e53376be
      Pablo Neira Ayuso 提交于
      With this patch, the conntrack refcount is initially set to zero and
      it is bumped once it is added to any of the list, so we fulfill
      Eric's golden rule which is that all released objects always have a
      refcount that equals zero.
      
      Andrey Vagin reports that nf_conntrack_free can't be called for a
      conntrack with non-zero ref-counter, because it can race with
      nf_conntrack_find_get().
      
      A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero
      ref-counter says that this conntrack is used. So when we release
      a conntrack with non-zero counter, we break this assumption.
      
      CPU1                                    CPU2
      ____nf_conntrack_find()
                                              nf_ct_put()
                                               destroy_conntrack()
                                              ...
                                              init_conntrack
                                               __nf_conntrack_alloc (set use = 1)
      atomic_inc_not_zero(&ct->use) (use = 2)
                                               if (!l4proto->new(ct, skb, dataoff, timeouts))
                                                nf_conntrack_free(ct); (use = 2 !!!)
                                              ...
                                              __nf_conntrack_alloc (set use = 1)
       if (!nf_ct_key_equal(h, tuple, zone))
        nf_ct_put(ct); (use = 0)
         destroy_conntrack()
                                              /* continue to work with CT */
      
      After applying the path "[PATCH] netfilter: nf_conntrack: fix RCU
      race in nf_conntrack_find_get" another bug was triggered in
      destroy_conntrack():
      
      <4>[67096.759334] ------------[ cut here ]------------
      <2>[67096.759353] kernel BUG at net/netfilter/nf_conntrack_core.c:211!
      ...
      <4>[67096.759837] Pid: 498649, comm: atdd veid: 666 Tainted: G         C ---------------    2.6.32-042stab084.18 #1 042stab084_18 /DQ45CB
      <4>[67096.759932] RIP: 0010:[<ffffffffa03d99ac>]  [<ffffffffa03d99ac>] destroy_conntrack+0x15c/0x190 [nf_conntrack]
      <4>[67096.760255] Call Trace:
      <4>[67096.760255]  [<ffffffff814844a7>] nf_conntrack_destroy+0x17/0x30
      <4>[67096.760255]  [<ffffffffa03d9bb5>] nf_conntrack_find_get+0x85/0x130 [nf_conntrack]
      <4>[67096.760255]  [<ffffffffa03d9fb2>] nf_conntrack_in+0x352/0xb60 [nf_conntrack]
      <4>[67096.760255]  [<ffffffffa048c771>] ipv4_conntrack_local+0x51/0x60 [nf_conntrack_ipv4]
      <4>[67096.760255]  [<ffffffff81484419>] nf_iterate+0x69/0xb0
      <4>[67096.760255]  [<ffffffff814b5b00>] ? dst_output+0x0/0x20
      <4>[67096.760255]  [<ffffffff814845d4>] nf_hook_slow+0x74/0x110
      <4>[67096.760255]  [<ffffffff814b5b00>] ? dst_output+0x0/0x20
      <4>[67096.760255]  [<ffffffff814b66d5>] raw_sendmsg+0x775/0x910
      <4>[67096.760255]  [<ffffffff8104c5a8>] ? flush_tlb_others_ipi+0x128/0x130
      <4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
      <4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
      <4>[67096.760255]  [<ffffffff814c136a>] inet_sendmsg+0x4a/0xb0
      <4>[67096.760255]  [<ffffffff81444e93>] ? sock_sendmsg+0x13/0x140
      <4>[67096.760255]  [<ffffffff81444f97>] sock_sendmsg+0x117/0x140
      <4>[67096.760255]  [<ffffffff8102e299>] ? native_smp_send_reschedule+0x49/0x60
      <4>[67096.760255]  [<ffffffff81519beb>] ? _spin_unlock_bh+0x1b/0x20
      <4>[67096.760255]  [<ffffffff8109d930>] ? autoremove_wake_function+0x0/0x40
      <4>[67096.760255]  [<ffffffff814960f0>] ? do_ip_setsockopt+0x90/0xd80
      <4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
      <4>[67096.760255]  [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
      <4>[67096.760255]  [<ffffffff814457c9>] sys_sendto+0x139/0x190
      <4>[67096.760255]  [<ffffffff810efa77>] ? audit_syscall_entry+0x1d7/0x200
      <4>[67096.760255]  [<ffffffff810ef7c5>] ? __audit_syscall_exit+0x265/0x290
      <4>[67096.760255]  [<ffffffff81474daf>] compat_sys_socketcall+0x13f/0x210
      <4>[67096.760255]  [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
      
      I have reused the original title for the RFC patch that Andrey posted and
      most of the original patch description.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Andrew Vagin <avagin@parallels.com>
      Cc: Florian Westphal <fw@strlen.de>
      Reported-by: NAndrew Vagin <avagin@parallels.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NAndrew Vagin <avagin@parallels.com>
      e53376be
  5. 05 2月, 2014 1 次提交
    • M
      net: ethoc: set up MII management bus clock · a13aff06
      Max Filippov 提交于
      MII management bus clock is derived from the MAC clock by dividing it by
      MIIMODER register CLKDIV field value. This value may need to be set up
      in case it is undefined or its default value is too high (and
      communication with PHY is too slow) or too low (and communication with
      PHY is impossible). The value of CLKDIV is not specified directly, but
      is derived from the MAC clock for the default MII management bus frequency
      of 2.5MHz. The MAC clock may be specified in the platform data, or in
      the 'clocks' device tree attribute.
      Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a13aff06
  6. 03 2月, 2014 1 次提交
  7. 02 2月, 2014 1 次提交
  8. 01 2月, 2014 1 次提交
  9. 31 1月, 2014 7 次提交
    • Z
      xen/grant-table: Avoid m2p_override during mapping · 08ece5bb
      Zoltan Kiss 提交于
      The grant mapping API does m2p_override unnecessarily: only gntdev needs it,
      for blkback and future netback patches it just cause a lock contention, as
      those pages never go to userspace. Therefore this series does the following:
      - the original functions were renamed to __gnttab_[un]map_refs, with a new
        parameter m2p_override
      - based on m2p_override either they follow the original behaviour, or just set
        the private flag and call set_phys_to_machine
      - gnttab_[un]map_refs are now a wrapper to call __gnttab_[un]map_refs with
        m2p_override false
      - a new function gnttab_[un]map_refs_userspace provides the old behaviour
      
      It also removes a stray space from page.h and change ret to 0 if
      XENFEAT_auto_translated_physmap, as that is the only possible return value
      there.
      
      v2:
      - move the storing of the old mfn in page->index to gnttab_map_refs
      - move the function header update to a separate patch
      
      v3:
      - a new approach to retain old behaviour where it needed
      - squash the patches into one
      
      v4:
      - move out the common bits from m2p* functions, and pass pfn/mfn as parameter
      - clear page->private before doing anything with the page, so m2p_find_override
        won't race with this
      
      v5:
      - change return value handling in __gnttab_[un]map_refs
      - remove a stray space in page.h
      - add detail why ret = 0 now at some places
      
      v6:
      - don't pass pfn to m2p* functions, just get it locally
      Signed-off-by: NZoltan Kiss <zoltan.kiss@citrix.com>
      Suggested-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      08ece5bb
    • D
      mm: sl[uo]b: fix misleading comments · 433a91ff
      Dave Hansen 提交于
      On x86, SLUB creates and handles <=8192-byte allocations internally.
      It passes larger ones up to the allocator.  Saying "up to order 2" is,
      at best, ambiguous.  Is that order-1?  Or (order-2 bytes)?  Make
      it more clear.
      
      SLOB commits a similar sin.  It *handles* page-size requests, but the
      comment says that it passes up "all page size and larger requests".
      
      SLOB also swaps around the order of the very-similarly-named
      KMALLOC_SHIFT_HIGH and KMALLOC_SHIFT_MAX #defines.  Make it
      consistent with the order of the other two allocators.
      
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      433a91ff
    • M
      zsmalloc: add copyright · 31fc00bb
      Minchan Kim 提交于
      Add my copyright to the zsmalloc source code which I maintain.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31fc00bb
    • M
      zsmalloc: move it under mm · bcf1647d
      Minchan Kim 提交于
      This patch moves zsmalloc under mm directory.
      
      Before that, description will explain why we have needed custom
      allocator.
      
      Zsmalloc is a new slab-based memory allocator for storing compressed
      pages.  It is designed for low fragmentation and high allocation success
      rate on large object, but <= PAGE_SIZE allocations.
      
      zsmalloc differs from the kernel slab allocator in two primary ways to
      achieve these design goals.
      
      zsmalloc never requires high order page allocations to back slabs, or
      "size classes" in zsmalloc terms.  Instead it allows multiple
      single-order pages to be stitched together into a "zspage" which backs
      the slab.  This allows for higher allocation success rate under memory
      pressure.
      
      Also, zsmalloc allows objects to span page boundaries within the zspage.
      This allows for lower fragmentation than could be had with the kernel
      slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE.  With the
      kernel slab allocator, if a page compresses to 60% of it original size,
      the memory savings gained through compression is lost in fragmentation
      because another object of the same size can't be stored in the leftover
      space.
      
      This ability to span pages results in zsmalloc allocations not being
      directly addressable by the user.  The user is given an
      non-dereferencable handle in response to an allocation request.  That
      handle must be mapped, using zs_map_object(), which returns a pointer to
      the mapped region that can be used.  The mapping is necessary since the
      object data may reside in two different noncontigious pages.
      
      The zsmalloc fulfills the allocation needs for zram perfectly
      
      [sjenning@linux.vnet.ibm.com: borrow Seth's quote]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NNitin Gupta <ngupta@vflare.org>
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcf1647d
    • C
      kernel: use lockless list for smp_call_function_single · 6897fc22
      Christoph Hellwig 提交于
      Make smp_call_function_single and friends more efficient by using a
      lockless list.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6897fc22
    • Y
      memblock, bootmem: restore goal for alloc_low · 07bacb38
      Yinghai Lu 提交于
      Now we have memblock_virt_alloc_low to replace original bootmem api in
      swiotlb.
      
      But we should not use BOOTMEM_LOW_LIMIT for arch that does not support
      CONFIG_NOBOOTMEM, as old api take 0.
      
      | #define alloc_bootmem_low(x) \
      |        __alloc_bootmem_low(x, SMP_CACHE_BYTES, 0)
      |#define alloc_bootmem_low_pages_nopanic(x) \
      |        __alloc_bootmem_low_nopanic(x, PAGE_SIZE, 0)
      
      and we have
       #define BOOTMEM_LOW_LIMIT __pa(MAX_DMA_ADDRESS)
      for CONFIG_NOBOOTMEM.
      
      Restore goal to 0 to fix ia64 crash, that Tony found.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Reported-by: NTony Luck <tony.luck@gmail.com>
      Tested-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07bacb38
    • O
      can: add destructor for self generated skbs · 0ae89beb
      Oliver Hartkopp 提交于
      Self generated skbuffs in net/can/bcm.c are setting a skb->sk reference but
      no explicit destructor which is enforced since Linux 3.11 with commit
      376c7311 (net: add a temporary sanity check in skb_orphan()).
      
      This patch adds some helper functions to make sure that a destructor is
      properly defined when a sock reference is assigned to a CAN related skb.
      To create an unshared skb owned by the original sock a common helper function
      has been introduced to replace open coded functions to create CAN echo skbs.
      Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Tested-by: NAndre Naujoks <nautsch2@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ae89beb
  10. 30 1月, 2014 6 次提交
  11. 29 1月, 2014 7 次提交
    • J
      fsnotify: Do not return merged event from fsnotify_add_notify_event() · 83c0e1b4
      Jan Kara 提交于
      The event returned from fsnotify_add_notify_event() cannot ever be used
      safely as the event may be freed by the time the function returns (after
      dropping notification_mutex). So change the prototype to just return
      whether the event was added or merged into some existing event.
      Reported-and-tested-by: NJiri Kosina <jkosina@suse.cz>
      Reported-and-tested-by: NDave Jones <davej@fedoraproject.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      83c0e1b4
    • F
      Btrfs: add support for inode properties · 63541927
      Filipe David Borba Manana 提交于
      This change adds infrastructure to allow for generic properties for
      inodes. Properties are name/value pairs that can be associated with
      inodes for different purposes. They are stored as xattrs with the
      prefix "btrfs."
      
      Properties can be inherited - this means when a directory inode has
      inheritable properties set, these are added to new inodes created
      under that directory. Further, subvolumes can also have properties
      associated with them, and they can be inherited from their parent
      subvolume. Naturally, directory properties have priority over subvolume
      properties (in practice a subvolume property is just a regular
      property associated with the root inode, objectid 256, of the
      subvolume's fs tree).
      
      This change also adds one specific property implementation, named
      "compression", whose values can be "lzo" or "zlib" and it's an
      inheritable property.
      
      The corresponding changes to btrfs-progs were also implemented.
      A patch with xfstests for this feature will follow once there's
      agreement on this change/feature.
      
      Further, the script at the bottom of this commit message was used to
      do some benchmarks to measure any performance penalties of this feature.
      
      Basically the tests correspond to:
      
      Test 1 - create a filesystem and mount it with compress-force=lzo,
      then sequentially create N files of 64Kb each, measure how long it took
      to create the files, unmount the filesystem, mount the filesystem and
      perform an 'ls -lha' against the test directory holding the N files, and
      report the time the command took.
      
      Test 2 - create a filesystem and don't use any compression option when
      mounting it - instead set the compression property of the subvolume's
      root to 'lzo'. Then create N files of 64Kb, and report the time it took.
      The unmount the filesystem, mount it again and perform an 'ls -lha' like
      in the former test. This means every single file ends up with a property
      (xattr) associated to it.
      
      Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
      compression property, have no real effect other than adding more work
      when inheriting properties and taking more btree leaf space.
      
      Test 4 - same as test 3 but with 10 properties per file.
      
      Results (in seconds, and averages of 5 runs each), for different N
      numbers of files follow.
      
      * Without properties (test 1)
      
                          file creation time        ls -lha time
      10 000 files              3.49                   0.76
      100 000 files            47.19                   8.37
      1 000 000 files         518.51                 107.06
      
      * With 1 property (compression property set to lzo - test 2)
      
                          file creation time        ls -lha time
      10 000 files              3.63                    0.93
      100 000 files            48.56                    9.74
      1 000 000 files         537.72                  125.11
      
      * With 4 properties (test 3)
      
                          file creation time        ls -lha time
      10 000 files              3.94                    1.20
      100 000 files            52.14                   11.48
      1 000 000 files         572.70                  142.13
      
      * With 10 properties (test 4)
      
                          file creation time        ls -lha time
      10 000 files              4.61                    1.35
      100 000 files            58.86                   13.83
      1 000 000 files         656.01                  177.61
      
      The increased latencies with properties are essencialy because of:
      
      *) When creating an inode, we now synchronously write 1 more item
         (an xattr item) for each property inherited from the parent dir
         (or subvolume). This could be done in an asynchronous way such
         as we do for dir intex items (delayed-inode.c), which could help
         reduce the file creation latency;
      
      *) With properties, we now have larger fs trees. For this particular
         test each xattr item uses 75 bytes of leaf space in the fs tree.
         This could be less by using a new item for xattr items, instead of
         the current btrfs_dir_item, since we could cut the 'location' and
         'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
         total of 26 bytes per xattr item) from the btrfs_dir_item type.
      
      Also tried batching the xattr insertions (ignoring proper hash
      collision handling, since it didn't exist) when creating files that
      inherit properties from their parent inode/subvolume, but the end
      results were (surprisingly) essentially the same.
      
      Test script:
      
      $ cat test.pl
        #!/usr/bin/perl -w
      
        use strict;
        use Time::HiRes qw(time);
        use constant NUM_FILES => 10_000;
        use constant FILE_SIZES => (64 * 1024);
        use constant DEV => '/dev/sdb4';
        use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
        use constant TEST_DIR => (MNT_POINT . '/testdir');
      
        system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
      
        # following line for testing without properties
        #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        # following 2 lines for testing with properties
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
        system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
      
        system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
        my ($t1, $t2);
      
        $t1 = time();
        for (my $i = 1; $i <= NUM_FILES; $i++) {
            my $p = TEST_DIR . '/file_' . $i;
            open(my $f, '>', $p) or die "Error opening file!";
            $f->autoflush(1);
            for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
                print $f ('A' x 4096) or die "Error writing to file!";
            }
            close($f);
        }
        $t2 = time();
        print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        $t1 = time();
        system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
        $t2 = time();
        print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      63541927
    • J
      rwsem: add rwsem_is_contended · 4a444b1f
      Josef Bacik 提交于
      Btrfs needs a simple way to know if it needs to let go of it's read lock on a
      rwsem.  Introduce rwsem_is_contended to check to see if there are any waiters on
      this rwsem currently.  This is just a hueristic, it is meant to be light and not
      100% accurate and called by somebody already holding on to the rwsem in either
      read or write.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      4a444b1f
    • L
      Btrfs/tracepoint: update new flags for ordered extent TP · 792ddef0
      Liu Bo 提交于
      Flag BTRFS_ORDERED_TRUNCATED is a new one, update the tracepoint to
      support it.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      792ddef0
    • L
      Btrfs/tracepoint: fix to report right flags for ordered extent · 9d04a8ce
      Liu Bo 提交于
      We use set_bit() to assign ordered extent's flags, but in the related
      tracepoint we don't do the same thing, which makes the trace output
      not to parse flags correctly.
      
      Also, since the flags are bits stuff, we change to use __print_flags with
      a 'delim' instead of __print_symbolic.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9d04a8ce
    • J
      btrfs: add ioctl to export size of global metadata reservation · 01e219e8
      Jeff Mahoney 提交于
      btrfs filesystem df output will show the size of the metadata space
      and how much of it is used, and the user assumes that the difference
      is all usable space. Since that's not actually the case due to the
      global metadata reservation, we should provide the full picture to the
      user.
      
      This patch adds an ioctl that exports the size of the global metadata
      reservation so that btrfs filesystem df can report it.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      01e219e8
    • J
      btrfs: add ioctls to query/change feature bits online · 2eaa055f
      Jeff Mahoney 提交于
      There are some feature bits that require no offline setup and can
      be enabled online. I've only reviewed extended irefs, but there will
      probably be more.
      
      We introduce three new ioctls:
      - BTRFS_IOC_GET_SUPPORTED_FEATURES: query the kernel for supported features.
      - BTRFS_IOC_GET_FEATURES: query the kernel for enabled features on a per-fs
        basis, as well as querying for which features are changeable with mounted.
      - BTRFS_IOC_SET_FEATURES: change features on a per-fs basis.
      
      We introduce two new masks per feature set (_SAFE_SET and _SAFE_CLEAR) that
      allow us to define which features are safe to change at runtime.
      
      The failure modes for BTRFS_IOC_SET_FEATURES are as follows:
      - Enabling a completely unsupported feature: warns and returns -ENOTSUPP
      - Enabling a feature that can only be done offline: warns and returns -EPERM
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2eaa055f