1. 31 3月, 2015 2 次提交
    • S
      Universal Compactions with Small Files · b23bbaa8
      sdong 提交于
      Summary:
      With this change, we use L1 and up to store compaction outputs in universal compaction.
      The compaction pick logic stays the same. Outputs are stored in the largest "level" as possible.
      
      If options.num_levels=1, it behaves all the same as now.
      
      Test Plan:
      1) convert most of existing unit tests for universal comapaction to include the option of one level and multiple levels.
      2) add a unit test to cover parallel compaction in universal compaction and run it in one level and multiple levels
      3) add unit test to migrate from multiple level setting back to one level setting
      4) add a unit test to insert keys to trigger multiple rounds of compactions and verify results.
      
      Reviewers: rven, kradhakrishnan, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: meyering, leveldb, MarkCallaghan, dhruba
      
      Differential Revision: https://reviews.facebook.net/D34539
      b23bbaa8
    • I
      Clean up old log files in background threads · fd3dbef2
      Igor Canadi 提交于
      Summary:
      Cleaning up log files can do heavy IO, since we call ftruncate() in the destructor. We don't want to call ftruncate() in user threads.
      
      This diff moves cleaning to background threads (flush and compaction)
      
      Test Plan: make check, will also run valgrind
      
      Reviewers: yhchiang, rven, MarkCallaghan, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D36177
      fd3dbef2
  2. 28 3月, 2015 1 次提交
  3. 27 3月, 2015 1 次提交
    • I
      Dump compression info on startup · 030859eb
      Igor Canadi 提交于
      Summary: It's useful to know if we have compression support or no
      
      Test Plan:
      Observed this in my LOG:
      
            2015/03/26-10:34:35.460681 7f5b322b7840 Snappy supported
            2015/03/26-10:34:35.460682 7f5b322b7840 Zlib supported
            2015/03/26-10:34:35.460686 7f5b322b7840 Bzip supported
            2015/03/26-10:34:35.460687 7f5b322b7840 LZ4 NOT supported
      
      Reviewers: sdong, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D35955
      030859eb
  4. 24 3月, 2015 1 次提交
    • Y
      Improve ThreadStatusSingleCompaction · a057bb2a
      Yueh-Hsuan Chiang 提交于
      Summary:
      Improve ThreadStatusSingleCompaction in two ways:
      1. Use SYNC_POINT to ensure compaction won't happen
         before the test finishes its "Put Phase" instead of
         using sleep.
      2. In Put Phase, it continues until we have sufficient
         number of L0 files.  Note that during the put phase,
         there won't be any compaction that consumes L0 files
         because of item 1.
      
      Test Plan: ./db_test  --gtest_filter="*ThreadStatusSingleCompaction*"
      
      Reviewers: sdong, igor, rven
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D35727
      a057bb2a
  5. 20 3月, 2015 1 次提交
    • I
      Don't delete files when column family is dropped · b088c83e
      Igor Canadi 提交于
      Summary:
      To understand the bug read t5943287 and check out the new test in column_family_test (ReadDroppedColumnFamily), iter 0.
      
      RocksDB contract allowes you to read a drop column family as long as there is a live reference. However, since our iteration ignores dropped column families, AddLiveFiles() didn't mark files of a dropped column families as live. So we deleted them.
      
      In this patch I no longer ignore dropped column families in the iteration. I think this behavior was confusing and it also led to this bug. Now if an iterator client wants to ignore dropped column families, he needs to do it explicitly.
      
      Test Plan: Added a new unit test that is failing on master. Unit test succeeds now.
      
      Reviewers: sdong, rven, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D32535
      b088c83e
  6. 18 3月, 2015 1 次提交
  7. 17 3月, 2015 2 次提交
    • V
      Remove unused parameter in CancelAllBackgroundWork · 98c37fda
      Venkatesh Radhakrishnan 提交于
      Summary: Some suggestions for cleanup from Igor.
      
      Test Plan: Regression tests.
      
      Reviewers: igor
      
      Reviewed By: igor
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D35169
      98c37fda
    • V
      Speed up rocksDB close call. · b2b30865
      Venkatesh Radhakrishnan 提交于
      Summary:
      On RocksDB, when there are multiple instances doing
      flushes/compactions in the background, the close call takes a long time
      because the flushes/compactions need to complete before the database can
      shut down. If another instance is using the background threads and the compaction for this instance is in the queue since it has been scheduled, we still cannot shutdown. We now remove the scheduled background tasks which have not yet started running, so that shutdown is speeded up.
      
      Test Plan: DB Test added.
      
      Reviewers: yhchiang, igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D33741
      b2b30865
  8. 14 3月, 2015 1 次提交
    • I
      EventLogger · 52d8347a
      Igor Canadi 提交于
      Summary:
      Here's my proposal for making our LOGs easier to read by machines.
      
      The idea is to dump all events as JSON objects. JSON is easy to read by humans, but more importantly, it's easy to read by machines. That way, we can parse this, load into SQLite/mongo and then query or visualize.
      
      I started with table_create and table_delete events, but if everybody agrees, I'll continue by adding more events (flush/compaction/etc etc)
      
      Test Plan:
      Ran db_bench. Observed:
      2015/01/15-14:13:25.788019 1105ef000 EVENT_LOG_v1 {"time_micros": 1421360005788015, "event": "table_file_creation", "file_number": 12, "file_size": 1909699}
      2015/01/15-14:13:25.956500 110740000 EVENT_LOG_v1 {"time_micros": 1421360005956498, "event": "table_file_deletion", "file_number": 12}
      
      Reviewers: yhchiang, rven, dhruba, MarkCallaghan, lgalanis, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D31647
      52d8347a
  9. 12 3月, 2015 2 次提交
  10. 07 3月, 2015 1 次提交
  11. 04 3月, 2015 1 次提交
    • Y
      Fix a bug in stall time counter. Improve its output format. · 694988b6
      Yueh-Hsuan Chiang 提交于
      Summary: Fix a bug in stall time counter.  Improve its output format.
      
      Test Plan:
      export ROCKSDB_TESTS=Timeout
      ./db_test
      
      ./db_bench --benchmarks=fillrandom --stats_interval=10000 --statistics=true --stats_per_interval=1 --num=1000000 --threads=4 --level0_stop_writes_trigger=3 --level0_slowdown_writes_trigger=2
      
      sample output:
          Uptime(secs): 35.8 total, 0.0 interval
          Cumulative writes: 359590 writes, 359589 keys, 183047 batches, 2.0 writes per batch, 0.04 GB user ingest, stall seconds: 1786.008 ms
          Cumulative WAL: 359591 writes, 183046 syncs, 1.96 writes per sync, 0.04 GB written
          Interval writes: 253 writes, 253 keys, 128 batches, 2.0 writes per batch, 0.0 MB user ingest, stall time: 0 us
          Interval WAL: 253 writes, 128 syncs, 1.96 writes per sync, 0.00 MB written
      
      Reviewers: MarkCallaghan, igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D34275
      694988b6
  12. 03 3月, 2015 2 次提交
    • I
      options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size... · db037393
      Igor Canadi 提交于
      options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
      
      Summary:
      When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
      
      In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
      
      Test Plan: New unit tests and pass tests suites including valgrind.
      
      Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
      
      Reviewed By: ikabiljo
      
      Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D31437
      db037393
    • M
      Fix typo in log message · c4bd03a9
      Mark Callaghan 提交于
      Summary:
      fix typo
      
      Task ID: #
      
      Blame Rev:
      
      Test Plan:
      Revert Plan:
      
      Database Impact:
      
      Memcache Impact:
      
      Other Notes:
      
      EImportant:
      
      - begin *PUBLIC* platform impact section -
      Bugzilla: #
      - end platform impact -
      
      Reviewers: igor
      
      Reviewed By: igor
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D34251
      c4bd03a9
  13. 27 2月, 2015 1 次提交
    • I
      rocksdb: Add missing override · 62247ffa
      Igor Sugak 提交于
      Summary:
      When using latest clang (3.6 or 3.7/trunck) rocksdb is failing with many errors. Almost all of them are missing override errors. This diff adds missing override keyword. No manual changes.
      
      Prerequisites: bear and clang 3.5 build with extra tools
      
      ```lang=bash
      % USE_CLANG=1 bear make all # generate a compilation database http://clang.llvm.org/docs/JSONCompilationDatabase.html
      % clang-modernize -p . -include . -add-override
      % make format
      ```
      
      Test Plan:
      Make sure all tests are passing.
      ```lang=bash
      % #Use default fb code clang.
      % make check
      ```
      Verify less error and no missing override errors.
      ```lang=bash
      % # Have trunk clang present in path.
      % ROCKSDB_NO_FBCODE=1 CC=clang CXX=clang++ make
      ```
      
      Reviewers: igor, kradhakrishnan, rven, meyering, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D34077
      62247ffa
  14. 24 2月, 2015 1 次提交
  15. 20 2月, 2015 1 次提交
    • J
      build: do not relink every single binary just for a timestamp · a42324e3
      Jim Meyering 提交于
      Summary:
      Prior to this change, "make check" would always waste a lot of
      time relinking 60+ binaries. With this change, it does that
      only when the generated file, util/build_version.cc, changes,
      and that happens only when the date changes or when the
      current git SHA changes.
      
      This change makes some other improvements: before, there was no
      rule to build a deleted util/build_version.cc. If it was somehow
      removed, any attempt to link a program would fail.
      There is no longer any need for the separate file,
      build_tools/build_detect_version.  Its functionality is
      now in the Makefile.
      
      * Makefile (DEPFILES): Don't filter-out util/build_version.cc.
      No need, and besides, removing that dependency was wrong.
      (date, git_sha, gen_build_version): New helper variables.
      (util/build_version.cc): New rule, to create this file
      and update it only if it would contain new information.
      * build_tools/build_detect_platform: Remove file.
      * db/db_impl.cc: Now, print only date (not the time).
      * util/build_version.h (rocksdb_build_compile_time): Remove
      declaration.  No longer used.
      
      Test Plan:
      - Run "make check" twice, and note that the second time no linking is performed.
      - Remove util/build_version.cc and ensure that any "make"
      command regenerates it before doing anything else.
      - Run this: strings librocksdb.a|grep _build_.
      That prints output including the following:
      
        rocksdb_build_git_date:2015-02-19
        rocksdb_build_git_sha:2.8.fb-1792-g3cb6cc0
      
      Reviewers: ljin, sdong, igor
      
      Reviewed By: igor
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D33591
      a42324e3
  16. 19 2月, 2015 1 次提交
    • V
      Managed iterator · 7d817268
      Venkatesh Radhakrishnan 提交于
      Summary:
      This is a diff for managed iterator. A managed iterator
      is a wrapper around an iterator which saves the options for that
      iterator as well as the current key/value so that the underlying iterator
      and its associated memory can be released when it is aged out
      automatically or on the request of the user. Will provide the automatic release as a follow-up diff.
      
      Test Plan: Managed* tests in db_test and XF tests for managed iterator
      
      Reviewers: igor, yhchiang, anthony, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D31401
      7d817268
  17. 18 2月, 2015 1 次提交
  18. 13 2月, 2015 1 次提交
    • I
      Introduce job_id for flush and compaction · e7ea51a8
      Igor Canadi 提交于
      Summary:
      It would be good to assing background job their IDs. Two benefits:
      1) makes LOGs more readable
      2) I might use it in my EventLogger, which will try to make our LOG easier to read/query/visualize
      
      Test Plan: ran rocksdb, read the LOG
      
      Reviewers: sdong, rven, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D31617
      e7ea51a8
  19. 10 2月, 2015 2 次提交
    • I
      Fix deleting obsolete files #2 · 863009b5
      Igor Canadi 提交于
      Summary: For description of the bug, see comment in db_test. The fix is pretty straight forward.
      
      Test Plan: added unit test. eventually we need better testing of FOF/POF process.
      
      Reviewers: yhchiang, rven, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D33081
      863009b5
    • S
      Print DB pointer when opening a DB · 91ac3b20
      sdong 提交于
      Summary: Having a pointer for DB will be helpful to debug when GDB or working on a dump. If the client process doesn't have any thread actively working on RocksDB, it can be hard to find out.
      
      Test Plan: make all check
      
      Reviewers: rven, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: yoshinorim, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D33159
      91ac3b20
  20. 07 2月, 2015 1 次提交
    • I
      Fix deleting obsolete files · 2a979822
      Igor Canadi 提交于
      Summary:
      This diff basically reverts D30249 and also adds a unit test that was failing before this patch.
      
      I have no idea how I didn't catch this terrible bug when writing a diff, sorry about that :(
      
      I think we should redesign our system of keeping track of and deleting files. This is already a second bug in this critical piece of code. I'll think of few ideas.
      
      BTW this diff is also a regression when running lots of column families. I plan to revisit this separately.
      
      Test Plan: added a unit test
      
      Reviewers: yhchiang, rven, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D33045
      2a979822
  21. 06 2月, 2015 1 次提交
  22. 05 2月, 2015 1 次提交
    • Y
      Add a counter for collecting the wait time on db mutex. · 181191a1
      Yueh-Hsuan Chiang 提交于
      Summary:
      Add a counter for collecting the wait time on db mutex.
      Also add MutexWrapper and CondVarWrapper for measuring wait time.
      
      Test Plan:
      ./db_test
      export ROCKSDB_TESTS=MutexWaitStats
      ./db_test
      
      verify stats output using db_bench
      make clean
      make release
      ./db_bench --statistics=1 --benchmarks=fillseq,readwhilewriting --num=10000 --threads=10
      
      Sample output:
          rocksdb.db.mutex.wait.micros COUNT : 7546866
      
      Reviewers: MarkCallaghan, rven, sdong, igor
      
      Reviewed By: igor
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D32787
      181191a1
  23. 28 1月, 2015 1 次提交
  24. 27 1月, 2015 2 次提交
    • S
      Rename DBImpl::log_dir_unsynced_ to log_dir_synced_ · be8f0b12
      sdong 提交于
      Summary: log_dir_unsynced_ is a confusing name. Rename it to log_dir_synced_ and flip the value.
      
      Test Plan: Run ./fault_injection_test
      
      Reviewers: rven, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D32235
      be8f0b12
    • S
      Sync WAL Directory and DB Path if different from DB directory · d888c957
      sdong 提交于
      Summary:
      1. If WAL directory is different from db directory. Sync the directory after creating a log file under it.
      2. After creating an SST file, sync its parent directory instead of DB directory.
      3. change the check of kResetDeleteUnsyncedFiles in fault_injection_test. Since we changed the behavior to sync log files' parent directory after first WAL sync, instead of creating, kResetDeleteUnsyncedFiles will not guarantee to show post sync updates.
      
      Test Plan: make all check
      
      Reviewers: yhchiang, rven, igor
      
      Reviewed By: igor
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D32067
      d888c957
  25. 24 1月, 2015 1 次提交
    • I
      Fix data race #2 · 42189612
      Igor Canadi 提交于
      Summary: We should not be calling InternalStats methods outside of the mutex.
      
      Test Plan:
      COMPILE_WITH_TSAN=1 m db_test && ROCKSDB_TESTS=CompactionTrigger ./db_test
      
      failing before the diff, works now
      
      Reviewers: yhchiang, rven, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D32127
      42189612
  26. 23 1月, 2015 1 次提交
    • S
      Sync manifest file when initializing it · 4e48753b
      sdong 提交于
      Summary: Now we don't sync manifest file when initializing it, so DB cannot be safely reopened before the first mem table flush. Fix it by syncing it. This fixes fault_injection_test.
      
      Test Plan: make all check
      
      Reviewers: rven, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D32001
      4e48753b
  27. 22 1月, 2015 1 次提交
  28. 16 1月, 2015 1 次提交
    • Y
      Remove Compaction::ReleaseInputs(). · b229f970
      Yueh-Hsuan Chiang 提交于
      Summary:
      This patch remove the unnecessary Compaction::ReleaseInputs().
      
      Compaction::ReleaseInputs() tries to unref its input_version
      and column_family.  However, such unref is always done in
      ~Compaction(), and all current ReleaseInputs() calls are
      right before the destructor.
      
      Test Plan: ./db_test
      
      Reviewers: igor
      
      Reviewed By: igor
      
      Subscribers: igor, rven, dhruba, sdong
      
      Differential Revision: https://reviews.facebook.net/D31605
      b229f970
  29. 13 1月, 2015 1 次提交
  30. 10 1月, 2015 1 次提交
    • S
      DB Stats Dump to print total stall time · 9132e52e
      sdong 提交于
      Summary:
      Add printing of stall time in DB Stats:
      
      Sample outputs:
      
      ** DB Stats **
      Uptime(secs): 53.2 total, 1.7 interval
      Cumulative writes: 625940 writes, 625939 keys, 625940 batches, 1.0 writes per batch, 0.49 GB user ingest, stall micros: 50691070
      Cumulative WAL: 625940 writes, 625939 syncs, 1.00 writes per sync, 0.49 GB written
      Interval writes: 10859 writes, 10859 keys, 10859 batches, 1.0 writes per batch, 8.7 MB user ingest, stall micros: 1692319
      Interval WAL: 10859 writes, 10859 syncs, 1.00 writes per sync, 0.01 MB written
      
      Test Plan:
      make all check
      verify printing using db_bench
      
      Reviewers: igor, yhchiang, rven, MarkCallaghan
      
      Reviewed By: MarkCallaghan
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D31239
      9132e52e
  31. 07 1月, 2015 1 次提交
    • I
      Simplify column family concurrency · 7731d51c
      Igor Canadi 提交于
      Summary:
      This patch changes concurrency guarantees around ColumnFamilySet::column_families_ and ColumnFamilySet::column_families_data_.
      
      Before:
      * When mutating: lock DB mutex and spin lock
      * When reading: lock DB mutex OR spin lock
      
      After:
      * When mutating: lock DB mutex and be in write thread
      * When reading: lock DB mutex or be in write thread
      
      That way, we eliminate the spin lock that protects these hash maps and  simplify concurrency. That means we don't need to lock the spin lock during writing, since writing is mutually exclusive with column family create/drop (the only operations that mutate those hash maps).
      
      With these new restrictions, I also needed to move column family create to the write thread (column family drop was already in the write thread).
      
      Even though we don't need to lock the spin lock during write, impact on performance should be minimal -- the spin lock is almost never busy, so locking it is almost free.
      
      This addresses task t5116919.
      
      Test Plan:
      make check
      
      Stress test with lots and lots of column family drop and create:
      
         time ./db_stress --threads=30 --ops_per_thread=5000000 --max_key=5000 --column_families=200 --clear_column_family_one_in=100000 --verify_before_write=0  --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress/
      
      Reviewers: yhchiang, rven, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D30651
      7731d51c
  32. 06 1月, 2015 3 次提交
    • I
      Fix compaction summary log for trivial move · 07aa4e0e
      Igor Canadi 提交于
      Summary: When trivial move commit is done, we log the summary of the input version instead of current. This is inconsistent with other log messages and confusing.
      
      Test Plan: compiles
      
      Reviewers: sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D30939
      07aa4e0e
    • I
      Deprecating skip_log_error_on_recovery · 62ad0a9b
      Igor Canadi 提交于
      Summary:
      Since https://reviews.facebook.net/D16119, we ignore partial tailing writes. Because of that, we no longer need skip_log_error_on_recovery.
      
      The documentation says "Skip log corruption error on recovery (If client is ok with losing most recent changes)", while the option actually ignores any corruption of the WAL (not only just the most recent changes). This is very dangerous and can lead to DB inconsistencies. This was originally set up to ignore partial tailing writes, which we now do automatically (after D16119). I have digged up old task t2416297 which confirms my findings.
      
      Test Plan: There was actually no tests that verified correct behavior of skip_log_error_on_recovery.
      
      Reviewers: yhchiang, rven, dhruba, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D30603
      62ad0a9b
    • I