1. 27 1月, 2015 2 次提交
    • S
      Rename DBImpl::log_dir_unsynced_ to log_dir_synced_ · be8f0b12
      sdong 提交于
      Summary: log_dir_unsynced_ is a confusing name. Rename it to log_dir_synced_ and flip the value.
      
      Test Plan: Run ./fault_injection_test
      
      Reviewers: rven, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D32235
      be8f0b12
    • S
      Sync WAL Directory and DB Path if different from DB directory · d888c957
      sdong 提交于
      Summary:
      1. If WAL directory is different from db directory. Sync the directory after creating a log file under it.
      2. After creating an SST file, sync its parent directory instead of DB directory.
      3. change the check of kResetDeleteUnsyncedFiles in fault_injection_test. Since we changed the behavior to sync log files' parent directory after first WAL sync, instead of creating, kResetDeleteUnsyncedFiles will not guarantee to show post sync updates.
      
      Test Plan: make all check
      
      Reviewers: yhchiang, rven, igor
      
      Reviewed By: igor
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D32067
      d888c957
  2. 13 1月, 2015 1 次提交
  3. 22 12月, 2014 1 次提交
    • I
      Speed up FindObsoleteFiles() · 0acc7388
      Igor Canadi 提交于
      Summary:
      There are two versions of FindObsoleteFiles():
      * full scan, which is executed every 6 hours (and it's terribly slow)
      * no full scan, which is executed every time a background process finishes and iterator is deleted
      
      This diff is optimizing the second case (no full scan). Here's what we do before the diff:
      * Get the list of obsolete files (files with ref==0). Some files in obsolete_files set might actually be live.
      * Get the list of live files to avoid deleting files that are live.
      * Delete files that are in obsolete_files and not in live_files.
      
      After this diff:
      * The only files with ref==0 that are still live are files that have been part of move compaction. Don't include moved files in obsolete_files.
      * Get the list of obsolete files (which exclude moved files).
      * No need to get the list of live files, since all files in obsolete_files need to be deleted.
      
      I'll post the benchmark results, but you can get the feel of it here: https://reviews.facebook.net/D30123
      
      This depends on D30123.
      
      P.S. We should do full scan only in failure scenarios, not every 6 hours. I'll do this in a follow-up diff.
      
      Test Plan:
      One new unit test. Made sure that unit test fails if we don't have a `if (!f->moved)` safeguard in ~Version.
      
      make check
      
      Big number of compactions and flushes:
      
        ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0  --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
      
      Reviewers: yhchiang, rven, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D30249
      0acc7388
  4. 20 12月, 2014 1 次提交
    • I
      Rewritten system for scheduling background work · fdb6be4e
      Igor Canadi 提交于
      Summary:
      When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
      
      The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
      
      Here are the performance results:
      
      Command:
      
          ./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000  --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
      
      Before the patch:
      
           fillrandom   :      26.950 micros/op 37105 ops/sec;    4.1 MB/s
      
      After the patch:
      
            fillrandom   :      17.404 micros/op 57456 ops/sec;    6.4 MB/s
      
      Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
      
            fillrandom   :       7.590 micros/op 131758 ops/sec;   14.6 MB/s
      
      Test Plan:
      make check
      
      two stress tests:
      
      Big number of compactions and flushes:
      
          ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0  --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
      
      max_background_flushes=0, to verify that this case also works correctly
      
          ./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0  --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
      
      Reviewers: ljin, rven, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D30123
      fdb6be4e
  5. 12 12月, 2014 1 次提交
    • S
      Improve scalability of DB::GetSnapshot() · d7a48666
      sdong 提交于
      Summary: Now DB::GetSnapshot() doesn't scale to more column families, as it needs to go through all the column families to find whether snapshot is supported. This patch optimizes it.
      
      Test Plan:
      Add unit tests to cover negative cases.
      make all check
      
      Reviewers: yhchiang, rven, igor
      
      Reviewed By: igor
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D30093
      d7a48666
  6. 09 12月, 2014 1 次提交
  7. 06 12月, 2014 1 次提交
  8. 03 12月, 2014 1 次提交
    • J
      Enforce write buffer memory limit across column families · a14b7873
      Jonah Cohen 提交于
      Summary:
      Introduces a new class for managing write buffer memory across column
      families.  We supplement ColumnFamilyOptions::write_buffer_size with
      ColumnFamilyOptions::write_buffer, a shared pointer to a WriteBuffer
      instance that enforces memory limits before flushing out to disk.
      
      Test Plan: Added SharedWriteBuffer unit test to db_test.cc
      
      Reviewers: sdong, rven, ljin, igor
      
      Reviewed By: igor
      
      Subscribers: tnovak, yhchiang, dhruba, xjin, MarkCallaghan, yoshinorim
      
      Differential Revision: https://reviews.facebook.net/D22581
      a14b7873
  9. 21 11月, 2014 2 次提交
    • V
      Moved checkpoint to utilities · 004f416b
      Venkatesh Radhakrishnan 提交于
      Summary:
      Moved checkpoint to utilities.
      Addressed comments by Igor, Siying, Dhruba
      
      Test Plan: db_test/SnapshotLink
      
      Reviewers: dhruba, igor, sdong
      
      Reviewed By: igor
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D29079
      004f416b
    • Y
      Introduce GetThreadList API · d0c5f28a
      Yueh-Hsuan Chiang 提交于
      Summary:
      Add GetThreadList API, which allows developer to track the
      status of each process.  Currently, calling GetThreadList will
      only get the list of background threads in RocksDB with their
      thread-id and thread-type (priority) set.  Will add more support
      on this in the later diffs.
      
      ThreadStatus currently has the following properties:
      
        // An unique ID for the thread.
        const uint64_t thread_id;
      
        // The type of the thread, it could be ROCKSDB_HIGH_PRIORITY,
        // ROCKSDB_LOW_PRIORITY, and USER_THREAD
        const ThreadType thread_type;
      
        // The name of the DB instance where the thread is currently
        // involved with.  It would be set to empty string if the thread
        // does not involve in any DB operation.
        const std::string db_name;
      
        // The name of the column family where the thread is currently
        // It would be set to empty string if the thread does not involve
        // in any column family.
        const std::string cf_name;
      
        // The event that the current thread is involved.
        // It would be set to empty string if the information about event
        // is not currently available.
      
      Test Plan:
      ./thread_list_test
      export ROCKSDB_TESTS=GetThreadList
      ./db_test
      
      Reviewers: rven, igor, sdong, ljin
      
      Reviewed By: ljin
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D25047
      d0c5f28a
  10. 19 11月, 2014 1 次提交
  11. 15 11月, 2014 2 次提交
    • I
      Explicitly clean JobContext · 5c04acda
      Igor Canadi 提交于
      Summary: This way we can gurantee that old MemTables get destructed before DBImpl gets destructed, which might be useful if we want to make them depend on state from DBImpl.
      
      Test Plan: make check with asserts in JobContext's destructor
      
      Reviewers: ljin, sdong, yhchiang, rven, jonahcohen
      
      Reviewed By: jonahcohen
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D28959
      5c04acda
    • V
      Provide openable snapshots · 6c1b040c
      Venkatesh Radhakrishnan 提交于
      Summary: Store links to live files in directory on same disk
      
      Test Plan:
      Take snapshot and open it. Added a test GetSnapshotLink in
      db_test.
      
      Reviewers: sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D28713
      6c1b040c
  12. 14 11月, 2014 2 次提交
  13. 08 11月, 2014 2 次提交
    • Y
      CompactFiles, EventListener and GetDatabaseMetaData · 28c82ff1
      Yueh-Hsuan Chiang 提交于
      Summary:
      This diff adds three sets of APIs to RocksDB.
      
      = GetColumnFamilyMetaData =
      * This APIs allow users to obtain the current state of a RocksDB instance on one column family.
      * See GetColumnFamilyMetaData in include/rocksdb/db.h
      
      = EventListener =
      * A virtual class that allows users to implement a set of
        call-back functions which will be called when specific
        events of a RocksDB instance happens.
      * To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
      
      = CompactFiles =
      * CompactFiles API inputs a set of file numbers and an output level, and RocksDB
        will try to compact those files into the specified level.
      
      = Example =
      * Example code can be found in example/compact_files_example.cc, which implements
        a simple external compactor using EventListener, GetColumnFamilyMetaData, and
        CompactFiles API.
      
      Test Plan:
      listener_test
      compactor_test
      example/compact_files_example
      export ROCKSDB_TESTS=CompactFiles
      db_test
      export ROCKSDB_TESTS=MetaData
      db_test
      
      Reviewers: ljin, igor, rven, sdong
      
      Reviewed By: sdong
      
      Subscribers: MarkCallaghan, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D24705
      28c82ff1
    • I
      Redesign pending_outputs_ · 53af5d87
      Igor Canadi 提交于
      Summary:
      Here's a prototype of redesigning pending_outputs_. This way, we don't have to expose pending_outputs_ to other classes (CompactionJob, FlushJob, MemtableList). DBImpl takes care of it.
      
      Still have to write some comments, but should be good enough to start the discussion.
      
      Test Plan: make check, will also run stress test
      
      Reviewers: ljin, sdong, rven, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D28353
      53af5d87
  14. 05 11月, 2014 1 次提交
  15. 01 11月, 2014 1 次提交
    • I
      CompactionJob · 74eb4fbe
      Igor Canadi 提交于
      Summary:
      Long awaited CompactionJob class! Move most compaction-related things from DBImpl to CompactionJob, making CompactionJob easier to test and understand.
      
      Currently this is just replicating exactly the same functionality with as little as change as possible. As future work, we should:
      1. Add CompactionJob tests (I think I'll do that tomorrow)
      2. Reduce CompactionJob's state that it inherits from DBImpl
      3. Figure out how to do yielding to flush better. Currently I implemented a callback as we agreed yesterday, but I don't think it's a good long term solution.
      
      This reduces db_impl.cc from 5000+ LOC to 3400!
      
      Test Plan: make check, will add CompactionJob-specific tests, probably also move some tests from db_test to compaction_job_test
      
      Reviewers: rven, yhchiang, sdong, ljin
      
      Reviewed By: ljin
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D27957
      74eb4fbe
  16. 30 10月, 2014 1 次提交
    • I
      WalManager · 63590548
      Igor Canadi 提交于
      Summary: Decoupling code that deals with archived log files outside of DBImpl. That will make this code easier to reason about and test. It will also make the code easier to improve, because an improver doesn't have to understand DBImpl code in entirety.
      
      Test Plan: added test
      
      Reviewers: ljin, yhchiang, rven, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D27873
      63590548
  17. 29 10月, 2014 1 次提交
    • I
      FlushProcess · a39e931e
      Igor Canadi 提交于
      Summary:
      Abstract out FlushProcess and take it out of DBImpl.
      This also includes taking DeletionState outside of DBImpl.
      
      Currently this diff is only doing the refactoring. Future work includes:
      1. Decoupling flush_process.cc, make it depend on less state
      2. Write flush_process_test, which will mock out everything that FlushProcess depends on and test it in isolation
      
      Test Plan: make check
      
      Reviewers: rven, yhchiang, sdong, ljin
      
      Reviewed By: ljin
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D27561
      a39e931e
  18. 28 10月, 2014 1 次提交
    • I
      Deprecate AtomicPointer · 48842ab3
      Igor Canadi 提交于
      Summary: RocksDB already depends on C++11, so we might as well all the goodness that C++11 provides. This means that we don't need AtomicPointer anymore. The less things in port/, the easier it will be to port to other platforms.
      
      Test Plan: make check + careful visual review verifying that NoBarried got memory_order_relaxed, while Acquire/Release methods got memory_order_acquire and memory_order_release
      
      Reviewers: rven, yhchiang, ljin, sdong
      
      Reviewed By: ljin
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D27543
      48842ab3
  19. 17 10月, 2014 1 次提交
  20. 14 10月, 2014 1 次提交
  21. 03 10月, 2014 1 次提交
  22. 02 10月, 2014 1 次提交
    • L
      make compaction related options changeable · 5ec53f3e
      Lei Jin 提交于
      Summary:
      make compaction related options changeable. Most of changes are tedious,
      following the same convention: grabs MutableCFOptions at the beginning
      of compaction under mutex, then pass it throughout the job and register
      it in SuperVersion at the end.
      
      Test Plan: make all check
      
      Reviewers: igor, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D23349
      5ec53f3e
  23. 26 9月, 2014 1 次提交
    • L
      CompactedDBImpl · 3c680061
      Lei Jin 提交于
      Summary:
      Add a CompactedDBImpl that will enabled when calling OpenForReadOnly()
      and the DB only has one level (>0) of files. As a performan comparison,
      CuckooTable performs 2.1M/s with CompactedDBImpl vs. 1.78M/s with
      ReadOnlyDBImpl.
      
      Test Plan: db_bench
      
      Reviewers: yhchiang, igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D23553
      3c680061
  24. 18 9月, 2014 1 次提交
  25. 13 9月, 2014 1 次提交
    • I
      WriteThread · dee91c25
      Igor Canadi 提交于
      Summary: This diff just moves the write thread control out of the DBImpl. I will need this as I will control column family data concurrency by only accessing some data in the write thread. That way, we won't have to lock our accesses to column family hash table (mappings from IDs to CFDs).
      
      Test Plan: make check
      
      Reviewers: sdong, yhchiang, ljin
      
      Reviewed By: ljin
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D23301
      dee91c25
  26. 11 9月, 2014 1 次提交
    • I
      Push model for flushing memtables · 3d9e6f77
      Igor Canadi 提交于
      Summary:
      When memtable is full it calls the registered callback. That callback then registers column family as needing the flush. Every write checks if there are some column families that need to be flushed. This completely eliminates the need for MakeRoomForWrite() function and simplifies our Write code-path.
      
      There is some complexity with the concurrency when the column family is dropped. I made it a bit less complex by dropping the column family from the write thread in https://reviews.facebook.net/D22965. Let me know if you want to discuss this.
      
      Test Plan: make check works. I'll also run db_stress with creating and dropping column families for a while.
      
      Reviewers: yhchiang, sdong, ljin
      
      Reviewed By: ljin
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D23067
      3d9e6f77
  27. 10 9月, 2014 1 次提交
  28. 09 9月, 2014 2 次提交
    • L
      remove TailingIterator reference in db_impl.h · 171d4ff4
      Lei Jin 提交于
      Summary: as title
      
      Test Plan: make release
      
      Reviewers: igor
      
      Differential Revision: https://reviews.facebook.net/D23073
      171d4ff4
    • I
      Push- instead of pull-model for managing Write stalls · a2bb7c3c
      Igor Canadi 提交于
      Summary:
      Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
      * proceed with all writes without delay
      * delay all writes by fixed time
      * stop all writes
      
      The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
      
      When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
      
      This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
      
      Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
      
      Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
      
      Reviewed By: ljin
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D22791
      a2bb7c3c
  29. 06 9月, 2014 2 次提交
  30. 05 9月, 2014 2 次提交
    • S
      Remove path with arena==nullptr from NewInternalIterator · 45a5e3ed
      Stanislau Hlebik 提交于
      Summary:
      Simply code by removing code path which does not use Arena
      from NewInternalIterator
      
      Test Plan:
      make all check
      make valgrind_check
      
      Reviewers: sdong
      
      Reviewed By: sdong
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D22395
      45a5e3ed
    • L
      introduce ImmutableOptions · 5665e5e2
      Lei Jin 提交于
      Summary:
      As a preparation to support updating some options dynamically, I'd like
      to first introduce ImmutableOptions, which is a subset of Options that
      cannot be changed during the course of a DB lifetime without restart.
      
      ColumnFamily will keep both Options and ImmutableOptions. Any component
      below ColumnFamily should only take ImmutableOptions in their
      constructor. Other options should be taken from APIs, which will be
      allowed to adjust dynamically.
      
      I am yet to make changes to memtable and other related classes to take
      ImmutableOptions in their ctor. That can be done in a seprate diff as
      this one is already pretty big.
      
      Test Plan: make all check
      
      Reviewers: yhchiang, igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D22545
      5665e5e2
  31. 27 8月, 2014 1 次提交
  32. 26 8月, 2014 1 次提交