1. 03 10月, 2011 7 次提交
    • W
      writeback: stabilize bdi->dirty_ratelimit · 7381131c
      Wu Fengguang 提交于
      There are some imperfections in balanced_dirty_ratelimit.
      
      1) large fluctuations
      
      The dirty_rate used for computing balanced_dirty_ratelimit is merely
      averaged in the past 200ms (very small comparing to the 3s estimation
      period for write_bw), which makes rather dispersed distribution of
      balanced_dirty_ratelimit.
      
      It's pretty hard to average out the singular points by increasing the
      estimation period. Considering that the averaging technique will
      introduce very undesirable time lags, I give it up totally. (btw, the 3s
      write_bw averaging time lag is much more acceptable because its impact
      is one-way and therefore won't lead to oscillations.)
      
      The more practical way is filtering -- most singular
      balanced_dirty_ratelimit points can be filtered out by remembering some
      prev_balanced_rate and prev_prev_balanced_rate. However the more
      reliable way is to guard balanced_dirty_ratelimit with task_ratelimit.
      
      2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
      match could become unbalanced, which may lead to large systematical
      errors in balanced_dirty_ratelimit. The truncates, due to its possibly
      bumpy nature, can hardly be compensated smoothly. So let's face it. When
      some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit
      high, dirty pages will go higher than the setpoint. task_ratelimit will
      in turn become lower than dirty_ratelimit.  So if we consider both
      balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit
      only when they are on the same side of dirty_ratelimit, the systematical
      errors in balanced_dirty_ratelimit won't be able to bring
      dirty_ratelimit far away.
      
      The balanced_dirty_ratelimit estimation may also be inaccurate near
      @limit or @freerun, however is less an issue.
      
      3) since we ultimately want to
      
      - keep the fluctuations of task ratelimit as small as possible
      - keep the dirty pages around the setpoint as long time as possible
      
      the update policy used for (2) also serves the above goals nicely:
      if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit),
      and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit),
      there is no point to bring up dirty_ratelimit in a hurry only to hurt
      both the above two goals.
      
      So, we make use of task_ratelimit to limit the update of dirty_ratelimit
      in two ways:
      
      1) avoid changing dirty rate when it's against the position control target
         (the adjusted rate will slow down the progress of dirty pages going
         back to setpoint).
      
      2) limit the step size. task_ratelimit is changing values step by step,
         leaving a consistent trace comparing to the randomly jumping
         balanced_dirty_ratelimit. task_ratelimit also has the nice smaller
         errors in stable state and typically larger errors when there are big
         errors in rate.  So it's a pretty good limiting factor for the step
         size of dirty_ratelimit.
      
      Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit.
      task_ratelimit is merely used as a limiting factor.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      7381131c
    • W
      writeback: dirty rate control · be3ffa27
      Wu Fengguang 提交于
      It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
      when there are N dd tasks.
      
      On write() syscall, use bdi->dirty_ratelimit
      ============================================
      
          balance_dirty_pages(pages_dirtied)
          {
              task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
              pause = pages_dirtied / task_ratelimit;
              sleep(pause);
          }
      
      On every 200ms, update bdi->dirty_ratelimit
      ===========================================
      
          bdi_update_dirty_ratelimit()
          {
              task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
              balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate;
              bdi->dirty_ratelimit = balanced_dirty_ratelimit
          }
      
      Estimation of balanced bdi->dirty_ratelimit
      ===========================================
      
      balanced task_ratelimit
      -----------------------
      
      balance_dirty_pages() needs to throttle tasks dirtying pages such that
      the total amount of dirty pages stays below the specified dirty limit in
      order to avoid memory deadlocks. Furthermore we desire fairness in that
      tasks get throttled proportionally to the amount of pages they dirty.
      
      IOW we want to throttle tasks such that we match the dirty rate to the
      writeout bandwidth, this yields a stable amount of dirty pages:
      
              dirty_rate == write_bw                                          (1)
      
      The fairness requirement gives us:
      
              task_ratelimit = balanced_dirty_ratelimit
                             == write_bw / N                                  (2)
      
      where N is the number of dd tasks.  We don't know N beforehand, but
      still can estimate balanced_dirty_ratelimit within 200ms.
      
      Start by throttling each dd task at rate
      
              task_ratelimit = task_ratelimit_0                               (3)
                               (any non-zero initial value is OK)
      
      After 200ms, we measured
      
              dirty_rate = # of pages dirtied by all dd's / 200ms
              write_bw   = # of pages written to the disk / 200ms
      
      For the aggressive dd dirtiers, the equality holds
      
              dirty_rate == N * task_rate
                         == N * task_ratelimit_0                              (4)
      Or
              task_ratelimit_0 == dirty_rate / N                              (5)
      
      Now we conclude that the balanced task ratelimit can be estimated by
      
                                                            write_bw
              balanced_dirty_ratelimit = task_ratelimit_0 * ----------        (6)
                                                            dirty_rate
      
      Because with (4) and (5) we can get the desired equality (1):
      
                                                             write_bw
              balanced_dirty_ratelimit == (dirty_rate / N) * ----------
                                                             dirty_rate
                                       == write_bw / N
      
      Then using the balanced task ratelimit we can compute task pause times like:
      
              task_pause = task->nr_dirtied / task_ratelimit
      
      task_ratelimit with position control
      ------------------------------------
      
      However, while the above gives us means of matching the dirty rate to
      the writeout bandwidth, it at best provides us with a stable dirty page
      count (assuming a static system). In order to control the dirty page
      count such that it is high enough to provide performance, but does not
      exceed the specified limit we need another control.
      
      The dirty position control works by extending (2) to
      
              task_ratelimit = balanced_dirty_ratelimit * pos_ratio           (7)
      
      where pos_ratio is a negative feedback function that subjects to
      
      1) f(setpoint) = 1.0
      2) df/dx < 0
      
      That is, if the dirty pages are ABOVE the setpoint, we throttle each
      task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty
      pages are created less fast than they are cleaned, thus DROP to the
      setpoints (and the reverse).
      
      Based on (7) and the assumption that both dirty_ratelimit and pos_ratio
      remains CONSTANT for the past 200ms, we get
      
              task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio         (8)
      
      Putting (8) into (6), we get the formula used in
      bdi_update_dirty_ratelimit():
      
                                                      write_bw
              balanced_dirty_ratelimit *= pos_ratio * ----------              (9)
                                                      dirty_rate
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      be3ffa27
    • W
      writeback: add bg_threshold parameter to __bdi_update_bandwidth() · af6a3113
      Wu Fengguang 提交于
      No behavior change.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      af6a3113
    • W
      writeback: dirty position control · 6c14ae1e
      Wu Fengguang 提交于
      bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
      that the resulted task rate limit can drive the dirty pages back to the
      global/bdi setpoints.
      
      Old scheme is,
                                                |
                                 free run area  |  throttle area
        ----------------------------------------+---------------------------->
                                          thresh^                  dirty pages
      
      New scheme is,
      
        ^ task rate limit
        |
        |            *
        |             *
        |              *
        |[free run]      *      [smooth throttled]
        |                  *
        |                     *
        |                         *
        ..bdi->dirty_ratelimit..........*
        |                               .     *
        |                               .          *
        |                               .              *
        |                               .                 *
        |                               .                    *
        +-------------------------------.-----------------------*------------>
                                setpoint^                  limit^  dirty pages
      
      The slope of the bdi control line should be
      
      1) large enough to pull the dirty pages to setpoint reasonably fast
      
      2) small enough to avoid big fluctuations in the resulted pos_ratio and
         hence task ratelimit
      
      Since the fluctuation range of the bdi dirty pages is typically observed
      to be within 1-second worth of data, the bdi control line's slope is
      selected to be a linear function of bdi write bandwidth, so that it can
      adapt to slow/fast storage devices well.
      
      Assume the bdi control line
      
      	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
      
      where k is the negative slope.
      
      If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
      are fluctuating in range
      
      	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],
      
      we get slope
      
      	k = - 1 / (8 * write_bw)
      
      Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
      
      	x_intercept = bdi_setpoint + 8 * write_bw
      
      The global/bdi slopes are nicely complementing each other when the
      system has only one major bdi (indicated by bdi_thresh ~= thresh):
      
      1) slope of global control line    => scaling to the control scope size
      2) slope of main bdi control line  => scaling to the writeout bandwidth
      
      so that
      
      - in memory tight systems, (1) becomes strong enough to squeeze dirty
        pages inside the control scope
      
      - in large memory systems where the "gravity" of (1) for pulling the
        dirty pages to setpoint is too weak, (2) can back (1) up and drive
        dirty pages to bdi_setpoint ~= setpoint reasonably fast.
      
      Unfortunately in JBOD setups, the fluctuation range of bdi threshold
      is related to memory size due to the interferences between disks.  In
      this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
      
      Given equations
      
              span = x_intercept - bdi_setpoint
              k = df/dx = - 1 / span
      
      and the extremum values
      
              span = bdi_thresh
              dx = bdi_thresh
      
      we get
      
              df = - dx / span = - 1.0
      
      That means, when bdi_dirty deviates bdi_thresh up, pos_ratio and hence
      task ratelimit will fluctuate by -100%.
      
      peter: use 3rd order polynomial for the global control line
      
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      6c14ae1e
    • W
      writeback: account per-bdi accumulated dirtied pages · c8e28ce0
      Wu Fengguang 提交于
      Introduce the BDI_DIRTIED counter. It will be used for estimating the
      bdi's dirty bandwidth.
      
      CC: Jan Kara <jack@suse.cz>
      CC: Michael Rubin <mrubin@google.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      c8e28ce0
    • L
      Merge branch 'for-linus' of git://git.infradead.org/users/sameo/mfd-2.6 · 9b137769
      Linus Torvalds 提交于
      * 'for-linus' of git://git.infradead.org/users/sameo/mfd-2.6:
        mfd: Fix generic irq chip ack function name for jz4740-adc
      9b137769
    • L
      Merge branch 'for-linus' of git://github.com/tiwai/sound · 4edf5886
      Linus Torvalds 提交于
      * 'for-linus' of git://github.com/tiwai/sound:
        ALSA: hda - Fix a regression of the position-buffer check
      4edf5886
  2. 02 10月, 2011 1 次提交
  3. 01 10月, 2011 2 次提交
  4. 30 9月, 2011 12 次提交
    • P
      posix-cpu-timers: Cure SMP wobbles · d670ec13
      Peter Zijlstra 提交于
      David reported:
      
        Attached below is a watered-down version of rt/tst-cpuclock2.c from
        GLIBC.  Just build it with "gcc -o test test.c -lpthread -lrt" or
        similar.
      
        Run it several times, and you will see cases where the main thread
        will measure a process clock difference before and after the nanosleep
        which is smaller than the cpu-burner thread's individual thread clock
        difference.  This doesn't make any sense since the cpu-burner thread
        is part of the top-level process's thread group.
      
        I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
        64-bit binaries).
      
        For example:
      
        [davem@boricha build-x86_64-linux]$ ./test
        process: before(0.001221967) after(0.498624371) diff(497402404)
        thread:  before(0.000081692) after(0.498316431) diff(498234739)
        self:    before(0.001223521) after(0.001240219) diff(16698)
        [davem@boricha build-x86_64-linux]$ 
      
        The diff of 'process' should always be >= the diff of 'thread'.
      
        I make sure to wrap the 'thread' clock measurements the most tightly
        around the nanosleep() call, and that the 'process' clock measurements
        are the outer-most ones.
      
        ---
        #include <unistd.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <time.h>
        #include <fcntl.h>
        #include <string.h>
        #include <errno.h>
        #include <pthread.h>
      
        static pthread_barrier_t barrier;
      
        static void *chew_cpu(void *arg)
        {
      	  pthread_barrier_wait(&barrier);
      	  while (1)
      		  __asm__ __volatile__("" : : : "memory");
      	  return NULL;
        }
      
        int main(void)
        {
      	  clockid_t process_clock, my_thread_clock, th_clock;
      	  struct timespec process_before, process_after;
      	  struct timespec me_before, me_after;
      	  struct timespec th_before, th_after;
      	  struct timespec sleeptime;
      	  unsigned long diff;
      	  pthread_t th;
      	  int err;
      
      	  err = clock_getcpuclockid(0, &process_clock);
      	  if (err)
      		  return 1;
      
      	  err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
      	  if (err)
      		  return 1;
      
      	  pthread_barrier_init(&barrier, NULL, 2);
      	  err = pthread_create(&th, NULL, chew_cpu, NULL);
      	  if (err)
      		  return 1;
      
      	  err = pthread_getcpuclockid(th, &th_clock);
      	  if (err)
      		  return 1;
      
      	  pthread_barrier_wait(&barrier);
      
      	  err = clock_gettime(process_clock, &process_before);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(my_thread_clock, &me_before);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(th_clock, &th_before);
      	  if (err)
      		  return 1;
      
      	  sleeptime.tv_sec = 0;
      	  sleeptime.tv_nsec = 500000000;
      	  nanosleep(&sleeptime, NULL);
      
      	  err = clock_gettime(th_clock, &th_after);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(my_thread_clock, &me_after);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(process_clock, &process_after);
      	  if (err)
      		  return 1;
      
      	  diff = process_after.tv_nsec - process_before.tv_nsec;
      	  printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 process_before.tv_sec, process_before.tv_nsec,
      		 process_after.tv_sec, process_after.tv_nsec, diff);
      	  diff = th_after.tv_nsec - th_before.tv_nsec;
      	  printf("thread:  before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 th_before.tv_sec, th_before.tv_nsec,
      		 th_after.tv_sec, th_after.tv_nsec, diff);
      	  diff = me_after.tv_nsec - me_before.tv_nsec;
      	  printf("self:    before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 me_before.tv_sec, me_before.tv_nsec,
      		 me_after.tv_sec, me_after.tv_nsec, diff);
      
      	  return 0;
        }
      
      This is due to us using p->se.sum_exec_runtime in
      thread_group_cputime() where we iterate the thread group and sum all
      data. This does not take time since the last schedule operation (tick
      or otherwise) into account. We can cure this by using
      task_sched_runtime() at the cost of having to take locks.
      
      This also means we can (and must) do away with
      thread_group_sched_runtime() since the modified thread_group_cputime()
      is now more accurate and would deadlock when called from
      thread_group_sched_runtime().
      
      Aside of that it makes the function safe on 32 bit systems. The old
      code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
      64bit value and could be changed on another cpu at the same time.
      Reported-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org
      Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twinsTested-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      d670ec13
    • T
      ALSA: hda - Fix a regression of the position-buffer check · 798cb7e8
      Takashi Iwai 提交于
      The commit a810364a
          ALSA: hda - Handle -1 as invalid position, too
      caused a regression on some machines that require the position-buffer
      instead of LPIB, e.g. resulting in noises with mic recording with
      PulseAudio.
      
      This patch fixes the detection by delaying the test at the timing as
      same as 3.0, i.e. doing the position check only when requested in
      azx_position_ok().
      Reported-and-tested-by: NRocko Requin <rockorequin@hotmail.com>
      Signed-off-by: NTakashi Iwai <tiwai@suse.de>
      798cb7e8
    • R
      Resource: fix wrong resource window calculation · 47ea91b4
      Ram Pai 提交于
      __find_resource() incorrectly returns a resource window which overlaps
      an existing allocated window.  This happens when the parent's
      resource-window spans 0x00000000 to 0xffffffff and is entirely allocated
      to all its children resource-windows.
      
      __find_resource() looks for gaps in resource allocation among the
      children resource windows.  When it encounters the last child window it
      blindly tries the range next to one allocated to the last child.  Since
      the last child's window ends at 0xffffffff the calculation overflows,
      leading the algorithm to believe that any window in the range 0x0000000
      to 0xfffffff is available for allocation.  This leads to a conflicting
      window allocation.
      
      Michal Ludvig reported this issue seen on his platform.  The following
      patch fixes the problem and has been verified by Michal.  I believe this
      bug has been there for ages.  It got exposed by git commit 2bbc6942
      ("PCI : ability to relocate assigned pci-resources")
      Signed-off-by: NRam Pai <linuxram@us.ibm.com>
      Tested-by: NMichal Ludvig <mludvig@logix.net.nz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47ea91b4
    • L
      Merge branch 'for-linus' of git://github.com/NewDreamNetwork/ceph-client · 92bb062f
      Linus Torvalds 提交于
      * 'for-linus' of git://github.com/NewDreamNetwork/ceph-client:
        libceph: fix pg_temp mapping update
        libceph: fix pg_temp mapping calculation
        libceph: fix linger request requeuing
        libceph: fix parse options memory leak
        libceph: initialize ack_stamp to avoid unnecessary connection reset
      92bb062f
    • L
      Merge branch 'v4l_for_linus' of git://linuxtv.org/mchehab/for_linus · 7409b713
      Linus Torvalds 提交于
      * 'v4l_for_linus' of git://linuxtv.org/mchehab/for_linus:
        [media] omap3isp: Fix build error in ispccdc.c
        [media] uvcvideo: Fix crash when linking entities
        [media] v4l: Make sure we hold a reference to the v4l2_device before using it
        [media] v4l: Fix use-after-free case in v4l2_device_release
        [media] uvcvideo: Set alternate setting 0 on resume if the bus has been reset
        [media] OMAP_VOUT: Fix build break caused by update_mode removal in DSS2
      7409b713
    • L
      Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6 · 0ecdb12a
      Linus Torvalds 提交于
      * 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6:
        [S390] cio: fix cio_tpi ignoring adapter interrupts
        [S390] gmap: always up mmap_sem properly
        [S390] Do not clobber personality flags on exec
      0ecdb12a
    • L
      Merge git://github.com/davem330/sparc · 5fe858b5
      Linus Torvalds 提交于
      * git://github.com/davem330/sparc:
        sparc64: Force the execute bit in OpenFirmware's translation entries.
        sparc: Make '-p' boot option meaningful again.
        sparc, exec: remove redundant addr_limit assignment
        sparc64: Future proof Niagara cpu detection.
      5fe858b5
    • L
      Merge branch 'drm-intel-fixes' of git://people.freedesktop.org/~keithp/linux · 8e8e500f
      Linus Torvalds 提交于
      * 'drm-intel-fixes' of git://people.freedesktop.org/~keithp/linux:
        drm/i915: FBC off for ironlake and older, otherwise on by default
        drm/i915: Enable SDVO hotplug interrupts for HDMI and DVI
        drm/i915: Enable dither whenever display bpc < frame buffer bpc
      8e8e500f
    • B
      powerpc: Fix device-tree matching for Apple U4 bridge · 16fa42af
      Benjamin Herrenschmidt 提交于
      Apple Quad G5 has some oddity in it's device-tree which causes the new
      generic matching code to fail to relate nodes for PCI-E devices below U4
      with their respective struct pci_dev.  This breaks graphics on those
      machines among others.
      
      This fixes it using a quirk which copies the node pointer from the host
      bridge for the root complex, which makes the generic code work for the
      children afterward.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16fa42af
    • W
      bootup: move 'usermodehelper_enable()' a little earlier · b0f84374
      wangyanqing 提交于
      Commit d5767c53 ("bootup: move 'usermodehelper_enable()' to the end
      of do_basic_setup()") moved 'usermodehelper_enable()' to end of
      do_basic_setup() to after the initcalls.  But then I get failed to let
      uvesafb work on my computer, and lose the splash boot.
      
      So maybe we could start usermodehelper_enable a little early to make
      some task work that need eary init with the help of user mode.
      
      [ I would *really* prefer that initcalls not call into user space - even
        the real 'init' hasn't been execve'd yet, after all! But for uvesafb
        it really does look like we don't have much choice.
      
        I considered doing this when we mount the root filesystem, but
        depending on config options that is in multiple places.  We could do
        the usermode helper enable as a rootfs_initcall()..
      
        So I'm just using wang yanqing's trivial patch.  It's not wonderful,
        but it's simple and should work.  We should revisit this some day,
        though.      - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0f84374
    • J
      perf tools: Fix raw sample reading · 8e303f20
      Jiri Olsa 提交于
      Wrong pointer is being passed for raw data sanity checking, when parsing
      sample event.
      
      This ends up with invalid event and perf record being stuck in
      __perf_session__process_events function during processing build IDs
      (process_buildids function).
      
      Following command hangs up in my setup:
      	./perf record -e raw_syscalls:sys_enter ls
      
      The fix is to use proper pointer to the raw data instead of the 'u'
      union.
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/1317308709-9474-2-git-send-email-jolsa@redhat.comSigned-off-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      8e303f20
    • D
      sparc64: Force the execute bit in OpenFirmware's translation entries. · f4142cba
      David S. Miller 提交于
      In the OF 'translations' property, the template TTEs in the mappings
      never specify the executable bit.  This is the case even though some
      of these mappings are for OF's code segment.
      
      Therefore, we need to force the execute bit on in every mapping.
      
      This problem can only really trigger on Niagara/sun4v machines and the
      history behind this is a little complicated.
      
      Previous to sun4v, the sun4u TTE entries lacked a hardware execute
      permission bit.  So OF didn't have to ever worry about setting
      anything to handle executable pages.  Any valid TTE loaded into the
      I-TLB would be respected by the chip.
      
      But sun4v Niagara chips have a real hardware enforced executable bit
      in their TTEs.  So it has to be set or else the I-TLB throws an
      instruction access exception with type code 6 (protection violation).
      
      We've been extremely fortunate to not get bitten by this in the past.
      
      The best I can tell is that the OF's mappings for it's executable code
      were mapped using permanent locked mappings on sun4v in the past.
      Therefore, the fact that we didn't have the exec bit set in the OF
      translations we would use did not matter in practice.
      
      Thanks to Greg Onufer for helping me track this down.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4142cba
  5. 29 9月, 2011 3 次提交
    • L
      bootup: move 'usermodehelper_enable()' to the end of do_basic_setup() · d5767c53
      Linus Torvalds 提交于
      Doing it just before starting to call into cpu_idle() made a sick kind
      of sense only because the original bug we fixed (see commit
      288d5abe: "Boot up with usermodehelper disabled") was about problems
      with some scheduler data structures not being initialized, and they had
      better be initialized at that point.
      
      But it really didn't make any other conceptual sense, and doing it after
      the initial "schedule()" call for the idle thread actually opened up a
      race: what if the main initialization thread did everything without
      needing to sleep, and got all the way into user land too? Without
      actually having scheduled back to the idle thread?
      
      Now, in normal circumstances that doesn't ever happen, but it looks like
      Richard Cochran triggered exactly that on his ARM IXP4xx machines:
      
        "I have some ARM IXP4xx based machines that use the two on chip MAC
         ports (aka NPEs).  The NPE needs a firmware in order to function.
         Ever since the following commit [that 288d5abe one], it is no
         longer possible to bring up the interfaces during the init scripts."
      
      with a call trace showing an ioctl coming from user space. Richard says:
      
        "The init is busybox, and the startup script does mount, syslogd, and
         then ifup, so that all can go by quickly."
      
      The fix is to move the usermodehelper_enable() into the main 'init'
      thread, and just put it after we've done all our initcalls.  By then,
      everything really should be up, but we've obviously not actually started
      the user-mode portion of init yet.
      Reported-and-tested-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5767c53
    • S
      libceph: fix pg_temp mapping update · 8adc8b3d
      Sage Weil 提交于
      The incremental map updates have a record for each pg_temp mapping that is
      to be add/updated (len > 0) or removed (len == 0).  The old code was
      written as if the updates were a complete enumeration; that was just wrong.
      Update the code to remove 0-length entries and drop the rbtree traversal.
      
      This avoids misdirected (and hung) requests that manifest as server
      errors like
      
      [WRN] client4104 10.0.1.219:0/275025290 misdirected client4104.1:129 0.1 to osd0 not [1,0] in e11/11
      Signed-off-by: NSage Weil <sage@newdream.net>
      8adc8b3d
    • S
      libceph: fix pg_temp mapping calculation · 782e182e
      Sage Weil 提交于
      We need to apply the modulo pg_num calculation before looking up a pgid in
      the pg_temp mapping rbtree.  This fixes pg_temp mappings, and fixes
      (some) misdirected requests that result in messages like
      
      [WRN] client4104 10.0.1.219:0/275025290 misdirected client4104.1:129 0.1 to osd0 not [1,0] in e11/11
      
      on the server and stall make the client block without getting a reply (at
      least until the pg_temp mapping goes way, but that can take a long long
      time).
      
      Reorder calc_pg_raw() a bit to make more sense.
      Signed-off-by: NSage Weil <sage@newdream.net>
      782e182e
  6. 28 9月, 2011 15 次提交