1. 11 1月, 2012 9 次提交
    • J
      mm: try to distribute dirty pages fairly across zones · a756cf59
      Johannes Weiner 提交于
      The maximum number of dirty pages that exist in the system at any time is
      determined by a number of pages considered dirtyable and a user-configured
      percentage of those, or an absolute number in bytes.
      
      This number of dirtyable pages is the sum of memory provided by all the
      zones in the system minus their lowmem reserves and high watermarks, so
      that the system can retain a healthy number of free pages without having
      to reclaim dirty pages.
      
      But there is a flaw in that we have a zoned page allocator which does not
      care about the global state but rather the state of individual memory
      zones.  And right now there is nothing that prevents one zone from filling
      up with dirty pages while other zones are spared, which frequently leads
      to situations where kswapd, in order to restore the watermark of free
      pages, does indeed have to write pages from that zone's LRU list.  This
      can interfere so badly with IO from the flusher threads that major
      filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
      already, taking away the VM's only possibility to keep such a zone
      balanced, aside from hoping the flushers will soon clean pages from that
      zone.
      
      Enter per-zone dirty limits.  They are to a zone's dirtyable memory what
      the global limit is to the global amount of dirtyable memory, and try to
      make sure that no single zone receives more than its fair share of the
      globally allowed dirty pages in the first place.  As the number of pages
      considered dirtyable excludes the zones' lowmem reserves and high
      watermarks, the maximum number of dirty pages in a zone is such that the
      zone can always be balanced without requiring page cleaning.
      
      As this is a placement decision in the page allocator and pages are
      dirtied only after the allocation, this patch allows allocators to pass
      __GFP_WRITE when they know in advance that the page will be written to and
      become dirty soon.  The page allocator will then attempt to allocate from
      the first zone of the zonelist - which on NUMA is determined by the task's
      NUMA memory policy - that has not exceeded its dirty limit.
      
      At first glance, it would appear that the diversion to lower zones can
      increase pressure on them, but this is not the case.  With a full high
      zone, allocations will be diverted to lower zones eventually, so it is
      more of a shift in timing of the lower zone allocations.  Workloads that
      previously could fit their dirty pages completely in the higher zone may
      be forced to allocate from lower zones, but the amount of pages that
      "spill over" are limited themselves by the lower zones' dirty constraints,
      and thus unlikely to become a problem.
      
      For now, the problem of unfair dirty page distribution remains for NUMA
      configurations where the zones allowed for allocation are in sum not big
      enough to trigger the global dirty limits, wake up the flusher threads and
      remedy the situation.  Because of this, an allocation that could not
      succeed on any of the considered zones is allowed to ignore the dirty
      limits before going into direct reclaim or even failing the allocation,
      until a future patch changes the global dirty throttling and flusher
      thread activation so that they take individual zone states into account.
      
      			Test results
      
      15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
      40% dirty ratio
      16G USB thumb drive
      10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))
      
      		seconds			nr_vmscan_write
      		        (stddev)	       min|     median|        max
      xfs
      vanilla:	 549.747( 3.492)	     0.000|      0.000|      0.000
      patched:	 550.996( 3.802)	     0.000|      0.000|      0.000
      
      fuse-ntfs
      vanilla:	1183.094(53.178)	 54349.000|  59341.000|  65163.000
      patched:	 558.049(17.914)	     0.000|      0.000|     43.000
      
      btrfs
      vanilla:	 573.679(14.015)	156657.000| 460178.000| 606926.000
      patched:	 563.365(11.368)	     0.000|      0.000|   1362.000
      
      ext4
      vanilla:	 561.197(15.782)	     0.000|2725438.000|4143837.000
      patched:	 568.806(17.496)	     0.000|      0.000|      0.000
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a756cf59
    • J
      mm: exclude reserved pages from dirtyable memory · ab8fabd4
      Johannes Weiner 提交于
      Per-zone dirty limits try to distribute page cache pages allocated for
      writing across zones in proportion to the individual zone sizes, to reduce
      the likelihood of reclaim having to write back individual pages from the
      LRU lists in order to make progress.
      
      This patch:
      
      The amount of dirtyable pages should not include the full number of free
      pages: there is a number of reserved pages that the page allocator and
      kswapd always try to keep free.
      
      The closer (reclaimable pages - dirty pages) is to the number of reserved
      pages, the more likely it becomes for reclaim to run into dirty pages:
      
             +----------+ ---
             |   anon   |  |
             +----------+  |
             |          |  |
             |          |  -- dirty limit new    -- flusher new
             |   file   |  |                     |
             |          |  |                     |
             |          |  -- dirty limit old    -- flusher old
             |          |                        |
             +----------+                       --- reclaim
             | reserved |
             +----------+
             |  kernel  |
             +----------+
      
      This patch introduces a per-zone dirty reserve that takes both the lowmem
      reserve as well as the high watermark of the zone into account, and a
      global sum of those per-zone values that is subtracted from the global
      amount of dirtyable pages.  The lowmem reserve is unavailable to page
      cache allocations and kswapd tries to keep the high watermark free.  We
      don't want to end up in a situation where reclaim has to clean pages in
      order to balance zones.
      
      Not treating reserved pages as dirtyable on a global level is only a
      conceptual fix.  In reality, dirty pages are not distributed equally
      across zones and reclaim runs into dirty pages on a regular basis.
      
      But it is important to get this right before tackling the problem on a
      per-zone level, where the distance between reclaim and the dirty pages is
      mostly much smaller in absolute numbers.
      
      [akpm@linux-foundation.org: fix highmem build]
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab8fabd4
    • D
      mm, debug: test for online nid when allocating on single node · f6d7e0cb
      David Rientjes 提交于
      Calling alloc_pages_exact_node() means the allocation only passes the
      zonelist of a single node into the page allocator.  If that node isn't
      online, it's zonelist may never have been initialized causing a strange
      oops that may not immediately be clear.
      
      I recently debugged an issue where node 0 wasn't online and an allocator
      was passing 0 to alloc_pages_exact_node() and it resulted in a NULL
      pointer on zonelist->_zoneref.  If CONFIG_DEBUG_VM is enabled, though, it
      would be nice to catch this a bit earlier.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6d7e0cb
    • S
      mm: more intensive memory corruption debugging · c0a32fc5
      Stanislaw Gruszka 提交于
      With CONFIG_DEBUG_PAGEALLOC configured, the CPU will generate an exception
      on access (read,write) to an unallocated page, which permits us to catch
      code which corrupts memory.  However the kernel is trying to maximise
      memory usage, hence there are usually few free pages in the system and
      buggy code usually corrupts some crucial data.
      
      This patch changes the buddy allocator to keep more free/protected pages
      and to interlace free/protected and allocated pages to increase the
      probability of catching corruption.
      
      When the kernel is compiled with CONFIG_DEBUG_PAGEALLOC,
      debug_guardpage_minorder defines the minimum order used by the page
      allocator to grant a request.  The requested size will be returned with
      the remaining pages used as guard pages.
      
      The default value of debug_guardpage_minorder is zero: no change from
      current behaviour.
      
      [akpm@linux-foundation.org: tweak documentation, s/flg/flag/]
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0a32fc5
    • D
      kernel.h: add BUILD_BUG() macro · 1399ff86
      David Daney 提交于
      We can place this in definitions that we expect the compiler to remove by
      dead code elimination.  If this assertion fails, we get a nice error
      message at build time.
      
      The GCC function attribute error("message") was added in version 4.3, so
      we define a new macro __linktime_error(message) to expand to this for
      GCC-4.3 and later.  This will give us an error diagnostic from the
      compiler on the line that fails.  For other compilers
      __linktime_error(message) expands to nothing, and we have to be content
      with a link time error, but at least we will still get a build error.
      
      BUILD_BUG() expands to the undefined function __build_bug_failed() and
      will fail at link time if the compiler ever emits code for it.  On GCC-4.3
      and later, attribute((error())) is used so that the failure will be noted
      at compile time instead.
      Signed-off-by: NDavid Daney <david.daney@cavium.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: DM <dm.n9107@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1399ff86
    • M
      mm: avoid livelock on !__GFP_FS allocations · f90ac398
      Mel Gorman 提交于
      Colin Cross reported;
      
        Under the following conditions, __alloc_pages_slowpath can loop forever:
        gfp_mask & __GFP_WAIT is true
        gfp_mask & __GFP_FS is false
        reclaim and compaction make no progress
        order <= PAGE_ALLOC_COSTLY_ORDER
      
        These conditions happen very often during suspend and resume,
        when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
        allocations into __GFP_WAIT.
      
        The oom killer is not run because gfp_mask & __GFP_FS is false,
        but should_alloc_retry will always return true when order is less
        than PAGE_ALLOC_COSTLY_ORDER.
      
      In his fix, he avoided retrying the allocation if reclaim made no progress
      and __GFP_FS was not set.  The problem is that this would result in
      GFP_NOIO allocations failing that previously succeeded which would be very
      unfortunate.
      
      The big difference between GFP_NOIO and suspend converting GFP_KERNEL to
      behave like GFP_NOIO is that normally flushers will be cleaning pages and
      kswapd reclaims pages allowing GFP_NOIO to succeed after a short delay.
      The same does not necessarily apply during suspend as the storage device
      may be suspended.
      
      This patch special cases the suspend case to fail the page allocation if
      reclaim cannot make progress and adds some documentation on how
      gfp_allowed_mask is currently used.  Failing allocations like this may
      cause suspend to abort but that is better than a livelock.
      
      [mgorman@suse.de: Rework fix to be suspend specific]
      [rientjes@google.com: Move suspended device check to should_alloc_retry]
      Reported-by: NColin Cross <ccross@android.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f90ac398
    • K
      mm: remove unused pagevec_free · da066ad3
      Konstantin Khlebnikov 提交于
      It not exported and now nobody uses it.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da066ad3
    • K
      mm: add free_hot_cold_page_list() helper · cc59850e
      Konstantin Khlebnikov 提交于
      This patch adds helper free_hot_cold_page_list() to free list of 0-order
      pages.  It frees pages directly from list without temporary page-vector.
      It also calls trace_mm_pagevec_free() to simulate pagevec_free()
      behaviour.
      
      bloat-o-meter:
      
      add/remove: 1/1 grow/shrink: 1/3 up/down: 267/-295 (-28)
      function                                     old     new   delta
      free_hot_cold_page_list                        -     264    +264
      get_page_from_freelist                      2129    2132      +3
      __pagevec_free                               243     239      -4
      split_free_page                              380     373      -7
      release_pages                                606     510     -96
      free_page_list                               188       -    -188
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc59850e
    • J
      mm/page-writeback.c: make determine_dirtyable_memory static again · 1edf2234
      Johannes Weiner 提交于
      The tracing ring-buffer used this function briefly, but not anymore.
      Make it local to the writeback code again.
      
      Also, move the function so that no forward declaration needs to be
      reintroduced.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1edf2234
  2. 10 1月, 2012 4 次提交
  3. 09 1月, 2012 1 次提交
    • J
      jbd: Remove j_barrier mutex · 00482785
      Jan Kara 提交于
      j_barrier mutex is used for serializing different journal lock operations.  The
      problem with it is that e.g. FIFREEZE ioctl results in process leaving kernel
      with j_barrier mutex held which makes lockdep freak out. Also hibernation code
      wants to freeze filesystem but it cannot do so because it then cannot hibernate
      the system because of mutex being locked.
      
      So we remove j_barrier mutex and use direct wait on j_barrier_count instead.
      Since locking journal is a rare operation we don't have to care about fairness
      or such things.
      
      CC: Andrew Morton <akpm@linux-foundation.org>
      Acked-by: NJoel Becker <jlbec@evilplan.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      00482785
  4. 07 1月, 2012 10 次提交
  5. 06 1月, 2012 4 次提交
    • S
      dma-buf: Introduce dma buffer sharing mechanism · d15bd7ee
      Sumit Semwal 提交于
      This is the first step in defining a dma buffer sharing mechanism.
      
      A new buffer object dma_buf is added, with operations and API to allow easy
      sharing of this buffer object across devices.
      
      The framework allows:
      - creation of a buffer object, its association with a file pointer, and
         associated allocator-defined operations on that buffer. This operation is
         called the 'export' operation.
      - different devices to 'attach' themselves to this exported buffer object, to
        facilitate backing storage negotiation, using dma_buf_attach() API.
      - the exported buffer object to be shared with the other entity by asking for
         its 'file-descriptor (fd)', and sharing the fd across.
      - a received fd to get the buffer object back, where it can be accessed using
         the associated exporter-defined operations.
      - the exporter and user to share the scatterlist associated with this buffer
         object using map_dma_buf and unmap_dma_buf operations.
      
      Atleast one 'attach()' call is required to be made prior to calling the
      map_dma_buf() operation.
      
      Couple of building blocks in map_dma_buf() are added to ease introduction
      of sync'ing across exporter and users, and late allocation by the exporter.
      
      For this first version, this framework will work with certain conditions:
      - *ONLY* exporter will be allowed to mmap to userspace (outside of this
         framework - mmap is not a buffer object operation),
      - currently, *ONLY* users that do not need CPU access to the buffer are
         allowed.
      
      More details are there in the documentation patch.
      
      This is based on design suggestions from many people at the mini-summits[1],
      most notably from Arnd Bergmann <arnd@arndb.de>, Rob Clark <rob@ti.com> and
      Daniel Vetter <daniel@ffwll.ch>.
      
      The implementation is inspired from proof-of-concept patch-set from
      Tomasz Stanislawski <t.stanislaws@samsung.com>, who demonstrated buffer sharing
      between two v4l2 devices. [2]
      
      [1]: https://wiki.linaro.org/OfficeofCTO/MemoryManagement
      [2]: http://lwn.net/Articles/454389Signed-off-by: NSumit Semwal <sumit.semwal@linaro.org>
      Signed-off-by: NSumit Semwal <sumit.semwal@ti.com>
      Reviewed-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Reviewed-by: NDave Airlie <airlied@redhat.com>
      Reviewed-and-Tested-by: NRob Clark <rob.clark@linaro.org>
      Signed-off-by: NDave Airlie <airlied@redhat.com>
      d15bd7ee
    • I
      net: pack skb_shared_info more efficiently · 9f42f126
      Ian Campbell 提交于
      nr_frags can be 8 bits since 256 is plenty of fragments. This allows it to be
      packed with tx_flags.
      
      Also by moving ip6_frag_id and dataref (both 4 bytes) next to each other we can
      avoid a hole between ip6_frag_id and frag_list on 64 bit systems.
      Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f42f126
    • E
      net_sched: sfq: extend limits · 18cb8098
      Eric Dumazet 提交于
      SFQ as implemented in Linux is very limited, with at most 127 flows
      and limit of 127 packets. [ So if 127 flows are active, we have one
      packet per flow ]
      
      This patch brings to SFQ following features to cope with modern needs.
      
      - Ability to specify a smaller per flow limit of inflight packets.
          (default value being at 127 packets)
      
      - Ability to have up to 65408 active flows (instead of 127)
      
      - Ability to have head drops instead of tail drops
        (to drop old packets from a flow)
      
      Example of use : No more than 20 packets per flow, max 8000 flows, max
      20000 packets in SFQ qdisc, hash table of 65536 slots.
      
      tc qdisc add ... sfq \
              flows 8000 \
              depth 20 \
              headdrop \
              limit 20000 \
      	divisor 65536
      
      Ram usage :
      
      2 bytes per hash table entry (instead of previous 1 byte/entry)
      32 bytes per flow on 64bit arches, instead of 384 for QFQ, so much
      better cache hit ratio.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Dave Taht <dave.taht@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18cb8098
    • N
      netdev: FCoE: Add new ndo_get_fcoe_hbainfo() call · 68bad94e
      Neerav Parikh 提交于
      This adds a new ndo_get_fcoe_hbainfo() call in
      net_device_ops for FCoE protocol stack.
      
      If supported by the underlying device, the FCoE protocol
      stack will call this to get device specific information
      from the underlying device.
      This information will then be utilized by the FCoE protocol
      stack to register Fiber Channel HBA attributes with the
      Fiber Channel Management Service via Fabric Device
      Management Interface (FDMI) as per the T11 FC-GS
      specification.
      
      Changes in v2:
      - As per comments from David Miller aligning the parameters
      of the ndo_get_fcoe_hbainfo()
      Signed-off-by: NNeerav Parikh <Neerav.Parikh@intel.com>
      Tested-by: NRoss Brattain <ross.b.brattain@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68bad94e
  6. 05 1月, 2012 6 次提交
  7. 04 1月, 2012 6 次提交