1. 11 11月, 2011 1 次提交
    • J
      clocksource: Avoid selecting mult values that might overflow when adjusted · d65670a7
      John Stultz 提交于
      For some frequencies, the clocks_calc_mult_shift() function will
      unfortunately select mult values very close to 0xffffffff.  This
      has the potential to overflow when NTP adjusts the clock, adding
      to the mult value.
      
      This patch adds a clocksource.maxadj value, which provides
      an approximation of an 11% adjustment(NTP limits adjustments to
      500ppm and the tick adjustment is limited to 10%), which could
      be made to the clocksource.mult value. This is then used to both
      check that the current mult value won't overflow/underflow, as
      well as warning us if the timekeeping_adjust() code pushes over
      that 11% boundary.
      
      v2: Fix max_adjustment calculation, and improve WARN_ONCE
      messages.
      
      v3: Don't warn before maxadj has actually been set
      
      CC: Yong Zhang <yong.zhang0@gmail.com>
      CC: David Daney <ddaney.cavm@gmail.com>
      CC: Thomas Gleixner <tglx@linutronix.de>
      CC: Chen Jie <chenj@lemote.com>
      CC: zhangfx <zhangfx@lemote.com>
      CC: stable@kernel.org
      Reported-by: NChen Jie <chenj@lemote.com>
      Reported-by: Nzhangfx <zhangfx@lemote.com>
      Tested-by: NYong Zhang <yong.zhang0@gmail.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      d65670a7
  2. 12 10月, 2011 1 次提交
  3. 05 10月, 2011 2 次提交
  4. 21 9月, 2011 1 次提交
  5. 14 9月, 2011 1 次提交
  6. 13 9月, 2011 1 次提交
  7. 08 9月, 2011 9 次提交
    • P
      posix-cpu-timers: Cure SMP accounting oddities · e8abccb7
      Peter Zijlstra 提交于
      David reported:
      
        Attached below is a watered-down version of rt/tst-cpuclock2.c from
        GLIBC.  Just build it with "gcc -o test test.c -lpthread -lrt" or
        similar.
      
        Run it several times, and you will see cases where the main thread
        will measure a process clock difference before and after the nanosleep
        which is smaller than the cpu-burner thread's individual thread clock
        difference.  This doesn't make any sense since the cpu-burner thread
        is part of the top-level process's thread group.
      
        I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
        64-bit binaries).
      
        For example:
      
        [davem@boricha build-x86_64-linux]$ ./test
        process: before(0.001221967) after(0.498624371) diff(497402404)
        thread:  before(0.000081692) after(0.498316431) diff(498234739)
        self:    before(0.001223521) after(0.001240219) diff(16698)
        [davem@boricha build-x86_64-linux]$
      
        The diff of 'process' should always be >= the diff of 'thread'.
      
        I make sure to wrap the 'thread' clock measurements the most tightly
        around the nanosleep() call, and that the 'process' clock measurements
        are the outer-most ones.
      
        ---
        #include <unistd.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <time.h>
        #include <fcntl.h>
        #include <string.h>
        #include <errno.h>
        #include <pthread.h>
      
        static pthread_barrier_t barrier;
      
        static void *chew_cpu(void *arg)
        {
      	  pthread_barrier_wait(&barrier);
      	  while (1)
      		  __asm__ __volatile__("" : : : "memory");
      	  return NULL;
        }
      
        int main(void)
        {
      	  clockid_t process_clock, my_thread_clock, th_clock;
      	  struct timespec process_before, process_after;
      	  struct timespec me_before, me_after;
      	  struct timespec th_before, th_after;
      	  struct timespec sleeptime;
      	  unsigned long diff;
      	  pthread_t th;
      	  int err;
      
      	  err = clock_getcpuclockid(0, &process_clock);
      	  if (err)
      		  return 1;
      
      	  err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
      	  if (err)
      		  return 1;
      
      	  pthread_barrier_init(&barrier, NULL, 2);
      	  err = pthread_create(&th, NULL, chew_cpu, NULL);
      	  if (err)
      		  return 1;
      
      	  err = pthread_getcpuclockid(th, &th_clock);
      	  if (err)
      		  return 1;
      
      	  pthread_barrier_wait(&barrier);
      
      	  err = clock_gettime(process_clock, &process_before);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(my_thread_clock, &me_before);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(th_clock, &th_before);
      	  if (err)
      		  return 1;
      
      	  sleeptime.tv_sec = 0;
      	  sleeptime.tv_nsec = 500000000;
      	  nanosleep(&sleeptime, NULL);
      
      	  err = clock_gettime(th_clock, &th_after);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(my_thread_clock, &me_after);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(process_clock, &process_after);
      	  if (err)
      		  return 1;
      
      	  diff = process_after.tv_nsec - process_before.tv_nsec;
      	  printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 process_before.tv_sec, process_before.tv_nsec,
      		 process_after.tv_sec, process_after.tv_nsec, diff);
      	  diff = th_after.tv_nsec - th_before.tv_nsec;
      	  printf("thread:  before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 th_before.tv_sec, th_before.tv_nsec,
      		 th_after.tv_sec, th_after.tv_nsec, diff);
      	  diff = me_after.tv_nsec - me_before.tv_nsec;
      	  printf("self:    before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 me_before.tv_sec, me_before.tv_nsec,
      		 me_after.tv_sec, me_after.tv_nsec, diff);
      
      	  return 0;
        }
      
      This is due to us using p->se.sum_exec_runtime in
      thread_group_cputime() where we iterate the thread group and sum all
      data. This does not take time since the last schedule operation (tick
      or otherwise) into account. We can cure this by using
      task_sched_runtime() at the cost of having to take locks.
      
      This also means we can (and must) do away with
      thread_group_sched_runtime() since the modified thread_group_cputime()
      is now more accurate and would deadlock when called from
      thread_group_sched_runtime().
      Reported-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
      Cc: stable@kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      e8abccb7
    • M
      s390: Use direct ktime path for s390 clockevent device · 4f37a68c
      Martin Schwidefsky 提交于
      The clock comparator on s390 uses the same format as the TOD clock.
      If the value in the clock comparator is smaller than the current TOD
      value an interrupt is pending. Use the CLOCK_EVT_FEAT_KTIME feature
      to get the unmodified ktime of the next clockevent expiration and
      use it to program the clock comparator without querying the TOD clock.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: john stultz <johnstul@us.ibm.com>
      Link: http://lkml.kernel.org/r/20110823133143.153017933@de.ibm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      4f37a68c
    • M
      clockevents: Add direct ktime programming function · 65516f8a
      Martin Schwidefsky 提交于
      There is at least one architecture (s390) with a sane clockevent device
      that can be programmed with the equivalent of a ktime. No need to create
      a delta against the current time, the ktime can be used directly.
      
      A new clock device function 'set_next_ktime' is introduced that is called
      with the unmodified ktime for the timer if the clock event device has the 
      CLOCK_EVT_FEAT_KTIME bit set.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: john stultz <johnstul@us.ibm.com>
      Link: http://lkml.kernel.org/r/20110823133142.815350967@de.ibm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      65516f8a
    • M
      clockevents: Make minimum delay adjustments configurable · d1748302
      Martin Schwidefsky 提交于
      The automatic increase of the min_delta_ns of a clockevents device
      should be done in the clockevents code as the minimum delay is an
      attribute of the clockevents device.
      
      In addition not all architectures want the automatic adjustment, on a
      massively virtualized system it can happen that the programming of a
      clock event fails several times in a row because the virtual cpu has
      been rescheduled quickly enough. In that case the minimum delay will
      erroneously be increased with no way back. The new config symbol
      GENERIC_CLOCKEVENTS_MIN_ADJUST is used to enable the automatic
      adjustment. The config option is selected only for x86.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: john stultz <johnstul@us.ibm.com>
      Link: http://lkml.kernel.org/r/20110823133142.494157493@de.ibm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      d1748302
    • H
      nohz: Remove "Switched to NOHz mode" debugging messages · 29c158e8
      Heiko Carstens 提交于
      When performing cpu hotplug tests the kernel printk log buffer gets flooded
      with pointless "Switched to NOHz mode..." messages. Especially when afterwards
      analyzing a dump this might have removed more interesting stuff out of the
      buffer.
      Assuming that switching to NOHz mode simply works just remove the printk.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Link: http://lkml.kernel.org/r/20110823112046.GB2540@osiris.boeblingen.de.ibm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      29c158e8
    • M
      proc: Consider NO_HZ when printing idle and iowait times · a25cac51
      Michal Hocko 提交于
      show_stat handler of the /proc/stat file relies on kstat_cpu(cpu)
      statistics when priting information about idle and iowait times.
      This is OK if we are not using tickless kernel (CONFIG_NO_HZ) because
      counters are updated periodically.
      With NO_HZ things got more tricky because we are not doing idle/iowait
      accounting while we are tickless so the value might get outdated.
      Users of /proc/stat will notice that by unchanged idle/iowait values
      which is then interpreted as 0% idle/iowait time. From the user space
      POV this is an unexpected behavior and a change of the interface.
      
      Let's fix this by using get_cpu_{idle,iowait}_time_us which accounts the
      total idle/iowait time since boot and it doesn't rely on sampling or any
      other periodic activity. Fall back to the previous behavior if NO_HZ is
      disabled or not configured.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Link: http://lkml.kernel.org/r/39181366adac1b39cb6aa3cd53ff0f7c78d32676.1314172057.git.mhocko@suse.czSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a25cac51
    • M
      nohz: Make idle/iowait counter update conditional · 09a1d34f
      Michal Hocko 提交于
      get_cpu_{idle,iowait}_time_us update idle/iowait counters
      unconditionally if the given CPU is in the idle loop.
      
      This doesn't work well outside of CPU governors which are singletons
      so nobody (except for IRQ) can race with them.
      
      We will need to use both functions from /proc/stat handler to properly
      handle nohz idle/iowait times.
      
      Make the update depend on a non NULL last_update_time argument.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Link: http://lkml.kernel.org/r/11f23179472635ce52e78921d47a20216b872f23.1314172057.git.mhocko@suse.czSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      09a1d34f
    • M
      nohz: Fix update_ts_time_stat idle accounting · 6beea0cd
      Michal Hocko 提交于
      update_ts_time_stat currently updates idle time even if we are in
      iowait loop at the moment. The only real users of the idle counter
      (via get_cpu_idle_time_us) are CPU governors and they expect to get
      cumulative time for both idle and iowait times.
      The value (idle_sleeptime) is also printed to userspace by print_cpu
      but it prints both idle and iowait times so the idle part is misleading.
      
      Let's clean this up and fix update_ts_time_stat to account both counters
      properly and update consumers of idle to consider iowait time as well.
      If we do this we might use get_cpu_{idle,iowait}_time_us from other
      contexts as well and we will get expected values.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Link: http://lkml.kernel.org/r/e9c909c221a8da402c4da07e4cd968c3218f8eb1.1314172057.git.mhocko@suse.czSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      6beea0cd
    • M
      cputime: Clean up cputime_to_usecs and usecs_to_cputime macros · ef0e0f5e
      Michal Hocko 提交于
      Get rid of semicolon so that those expressions can be used also
      somewhere else than just in an assignment.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Link: http://lkml.kernel.org/r/7565417ce30d7e6b1ddc169843af0777dbf66e75.1314172057.git.mhocko@suse.czSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      ef0e0f5e
  8. 11 8月, 2011 9 次提交
  9. 10 8月, 2011 2 次提交
  10. 08 8月, 2011 7 次提交
  11. 07 8月, 2011 6 次提交
    • A
      Fix POSIX ACL permission check · 206b1d09
      Ari Savolainen 提交于
      After commit 3567866b: "RCUify freeing acls, let check_acl() go ahead in
      RCU mode if acl is cached" posix_acl_permission is being called with an
      unsupported flag and the permission check fails. This patch fixes the issue.
      Signed-off-by: NAri Savolainen <ari.m.savolainen@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      206b1d09
    • L
      Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd · c2f340a6
      Linus Torvalds 提交于
      * 'for-linus' of git://git.open-osd.org/linux-open-osd:
        ore: Make ore its own module
        exofs: Rename raid engine from exofs/ios.c => ore
        exofs: ios: Move to a per inode components & device-table
        exofs: Move exofs specific osd operations out of ios.c
        exofs: Add offset/length to exofs_get_io_state
        exofs: Fix truncate for the raid-groups case
        exofs: Small cleanup of exofs_fill_super
        exofs: BUG: Avoid sbi realloc
        exofs: Remove pnfs-osd private definitions
        nfs_xdr: Move nfs4_string definition out of #ifdef CONFIG_NFS_V4
      c2f340a6
    • L
      vfs: optimize inode cache access patterns · 3ddcd056
      Linus Torvalds 提交于
      The inode structure layout is largely random, and some of the vfs paths
      really do care.  The path lookup in particular is already quite D$
      intensive, and profiles show that accessing the 'inode->i_op->xyz'
      fields is quite costly.
      
      We already optimized the dcache to not unnecessarily load the d_op
      structure for members that are often NULL using the DCACHE_OP_xyz bits
      in dentry->d_flags, and this does something very similar for the inode
      ops that are used during pathname lookup.
      
      It also re-orders the fields so that the fields accessed by 'stat' are
      together at the beginning of the inode structure, and roughly in the
      order accessed.
      
      The effect of this seems to be in the 1-2% range for an empty kernel
      "make -j" run (which is fairly kernel-intensive, mostly in filename
      lookup), so it's visible.  The numbers are fairly noisy, though, and
      likely depend a lot on exact microarchitecture.  So there's more tuning
      to be done.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ddcd056
    • L
      vfs: renumber DCACHE_xyz flags, remove some stale ones · 830c0f0e
      Linus Torvalds 提交于
      Gcc tends to generate better code with small integers, including the
      DCACHE_xyz flag tests - so move the common ones to be first in the list.
      Also just remove the unused DCACHE_INOTIFY_PARENT_WATCHED and
      DCACHE_AUTOFS_PENDING values, their users no longer exists in the source
      tree.
      
      And add a "unlikely()" to the DCACHE_OP_COMPARE test, since we want the
      common case to be a nice straight-line fall-through.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      830c0f0e
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 7cd4767e
      Linus Torvalds 提交于
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        net: Compute protocol sequence numbers and fragment IDs using MD5.
        crypto: Move md5_transform to lib/md5.c
      7cd4767e
    • B
      ore: Make ore its own module · cf283ade
      Boaz Harrosh 提交于
      Export everything from ore need exporting. Change Kbuild and Kconfig
      to build ore.ko as an independent module. Import ore from exofs
      Signed-off-by: NBoaz Harrosh <bharrosh@panasas.com>
      cf283ade