1. 18 3月, 2016 1 次提交
    • M
      Adding pin_l0_filter_and_index_blocks_in_cache feature. · 522de4f5
      Marton Trencseni 提交于
      Summary:
      When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
      What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
      When the table reader is destroyed, it releases the pinned blocks (if there were any). This has to happen before the cache is destroyed, so I had to introduce a TableReader::Close(), to guarantee the order of destruction.
      
      Test Plan:
      Added two unit tests for this. Existing unit tests run fine (default is pin_l0_filter_and_index_blocks_in_cache=false).
      
      DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32
        Mac: OK.
        Linux: with D55287 patched in it's OK.
      
      Reviewers: sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D54801
      522de4f5
  2. 11 3月, 2016 1 次提交
    • Y
      Cache to have an option to fail Cache::Insert() when full · f71fc77b
      Yi Wu 提交于
      Summary:
      Cache to have an option to fail Cache::Insert() when full. Update call sites to check status and handle error.
      
      I totally have no idea what's correct behavior of all the call sites when they encounter error. Please let me know if you see something wrong or more unit test is needed.
      
      Test Plan: make check -j32, see tests pass.
      
      Reviewers: anthony, yhchiang, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D54705
      f71fc77b
  3. 10 2月, 2016 2 次提交
  4. 24 12月, 2015 1 次提交
    • A
      Skip bottom-level filter block caching when hit-optimized · e089db40
      Andrew Kryczka 提交于
      Summary:
      When Get() or NewIterator() trigger file loads, skip caching the filter block if
      (1) optimize_filters_for_hits is set and (2) the file is on the bottommost
      level. Also skip checking filters under the same conditions, which means that
      for a preloaded file or a file that was trivially-moved to the bottom level, its
      filter block will eventually expire from the cache.
      
      - added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
      - in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
      - in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr
      
      Test Plan:
      updated unit test:
      
        $ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits
      
      will also run 'make check'
      
      Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D51633
      e089db40
  5. 12 12月, 2015 1 次提交
    • A
      Use SST files for Transaction conflict detection · 3bfd3d39
      agiardullo 提交于
      Summary:
      Currently, transactions can fail even if there is no actual write conflict.  This is due to relying on only the memtables to check for write-conflicts.  Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings.
      
      With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts.  This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot.  Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged).
      
      Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread.  Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files.
      
      Test Plan: unit tests, db bench
      
      Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb, yoshinorim
      
      Differential Revision: https://reviews.facebook.net/D50475
      3bfd3d39
  6. 14 10月, 2015 1 次提交
    • S
      Seperate InternalIterator from Iterator · 35ad531b
      sdong 提交于
      Summary:
      Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.
      
      This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
      At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.
      
      Test Plan: Run all existing tests.
      
      Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven
      
      Reviewed By: rven
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D48549
      35ad531b
  7. 12 9月, 2015 1 次提交
  8. 03 9月, 2015 1 次提交
  9. 27 8月, 2015 2 次提交
    • I
      ReadaheadRandomAccessFile -- userspace readahead · 5f4166c9
      Igor Canadi 提交于
      Summary:
      ReadaheadRandomAccessFile acts as a transparent layer on top of RandomAccessFile. When a Read() request is issued, it issues a much bigger request to the OS and caches the result. When a new request comes in and we already have the data cached, it doesn't have to issue any requests to the OS.
      
      We add ReadaheadRandomAccessFile layer only when file is read during compactions.
      
      D45105 was incorrectly closed by Phabricator because I committed it to a separate branch (not master), so I'm resubmitting the diff.
      
      Test Plan: make check
      
      Reviewers: MarkCallaghan, sdong
      
      Reviewed By: sdong
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D45123
      5f4166c9
    • A
      Fix DBTest.GetProperty · 3795449c
      Andres Noetzli 提交于
      Summary:
      DBTest.GetProperty was failing occasionally (see task #8131266). The reason was
      that the test closed the database before the compaction was done. When the test
      reopened the database, RocksDB would schedule a compaction which in turn
      created table readers and lead the test to fail the assertion that
      rocksdb.estimate-table-readers-mem is 0. In most cases, GetIntProperty() of
      rocksdb.estimate-table-readers-mem happened before the compaction created the
      table readers, hiding the problem. This patch changes the
      WaitForFlushMemTable() to WaitForCompact(). WaitForFlushMemTable() is not
      necessary because it is already being called a couple of lines before without
      any insertions in-between.
      
      Test Plan:
      Insert `usleep(10000);` just after `Reopen(options);` on line 2333 to make the issue more likely, then run:
      make db_test && while ./db_test --gtest_filter=DBTest.GetProperty; do true; done
      
      Reviewers: rven, yhchiang, anthony, igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D45603
      3795449c
  10. 21 8月, 2015 1 次提交
    • S
      Add options.new_table_reader_for_compaction_inputs · 9130873a
      sdong 提交于
      Summary: Currently compaction inputs share the same file descriptor and table reader as other foreground threads. It makes fadvise works less predictable. Add options.new_table_reader_for_compaction_inputs to enforce to create a new file descriptor and new table reader for it.
      
      Test Plan: Add the option.
      
      Reviewers: rven, anthony, kradhakrishnan, IslamAbdelRahman, igor, yhchiang
      
      Reviewed By: igor
      
      Subscribers: igor, MarkCallaghan, leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D43311
      9130873a
  11. 15 8月, 2015 1 次提交
    • S
      Measure file read latency histogram per level · 72613657
      sdong 提交于
      Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.
      
      Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected
      
      Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: MarkCallaghan, leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D44193
      72613657
  12. 06 8月, 2015 1 次提交
    • S
      Add statistic histogram "rocksdb.sst.read.micros" · 3ae386ea
      sdong 提交于
      Summary: Measure read latency histogram and put in statistics. Compaction inputs are excluded from it when possible (unfortunately usually no possible as we usually take table reader from table cache.
      
      Test Plan:
      Run db_bench and it shows the stats, like:
      
      rocksdb.sst.read.micros statistics Percentiles :=> 50 : 1.238522 95 : 2.529740 99 : 3.912180
      
      Reviewers: kradhakrishnan, rven, anthony, IslamAbdelRahman, MarkCallaghan, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D43275
      3ae386ea
  13. 18 7月, 2015 1 次提交
    • S
      Move rate_limiter, write buffering, most perf context instrumentation and most... · 6e9fbeb2
      sdong 提交于
      Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
      
      Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
      
      Test Plan: Run all existing unit tests.
      
      Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D42321
      6e9fbeb2
  14. 11 7月, 2015 1 次提交
  15. 24 6月, 2015 1 次提交
    • G
      Implement a table-level row cache · 782a1590
      Giuseppe Ottaviano 提交于
      Summary:
      Implementation of a table-level row cache.
      It only caches point queries done through the `DB::Get` interface, queries done through the `Iterator` interface will completely skip the cache.
      
      Supports snapshots and merge operations.
      
      Test Plan: Ran `make valgrind_check commit-prereq`
      
      Reviewers: igor, philipp, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D39849
      782a1590
  16. 29 10月, 2014 1 次提交
    • I
      TableMock + framework for mock classes · abac3d64
      Igor Canadi 提交于
      Summary:
      This diff replaces BlockBasedTable in flush_job_test with TableMock, making it depend on less things and making it closer to an unit test than integration test.
      
      It also introduces a framework to compile mock classes -- Any file named *mock.cc will not be compiled into the build. It will only get compiled into the tests. What way we can mock out most other classes, Version, VersionSet, DBImpl, etc.
      
      Test Plan: flush_job_test
      
      Reviewers: ljin, rven, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D27681
      abac3d64
  17. 30 9月, 2014 1 次提交
    • L
      use GetContext to replace callback function pointer · 2faf49d5
      Lei Jin 提交于
      Summary:
      Intead of passing callback function pointer and its arg on Table::Get()
      interface, passing GetContext. This makes the interface cleaner and
      possible better perf. Also adding a fast pass for SaveValue()
      
      Test Plan: make all check
      
      Reviewers: igor, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D24057
      2faf49d5
  18. 05 9月, 2014 1 次提交
    • L
      introduce ImmutableOptions · 5665e5e2
      Lei Jin 提交于
      Summary:
      As a preparation to support updating some options dynamically, I'd like
      to first introduce ImmutableOptions, which is a subset of Options that
      cannot be changed during the course of a DB lifetime without restart.
      
      ColumnFamily will keep both Options and ImmutableOptions. Any component
      below ColumnFamily should only take ImmutableOptions in their
      constructor. Other options should be taken from APIs, which will be
      allowed to adjust dynamically.
      
      I am yet to make changes to memtable and other related classes to take
      ImmutableOptions in their ctor. That can be done in a seprate diff as
      this one is already pretty big.
      
      Test Plan: make all check
      
      Reviewers: yhchiang, igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D22545
      5665e5e2
  19. 07 8月, 2014 1 次提交
    • S
      Add DB property "rocksdb.estimate-table-readers-mem" · 1242bfca
      sdong 提交于
      Summary:
      Add a DB Property "rocksdb.estimate-table-readers-mem" to return estimated memory usage by all loaded table readers, other than allocated from block cache.
      
      Refactor the property codes to allow getting property from a version, with DB mutex not acquired.
      
      Test Plan: Add several checks of this new property in existing codes for various cases.
      
      Reviewers: yhchiang, ljin
      
      Reviewed By: ljin
      
      Subscribers: xjin, igor, leveldb
      
      Differential Revision: https://reviews.facebook.net/D20733
      1242bfca
  20. 03 7月, 2014 1 次提交
  21. 20 6月, 2014 1 次提交
    • I
      Remove seek compaction · d4a84233
      Igor Canadi 提交于
      Summary:
      As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases.
      
      This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction.
      
      There is one test case that relied on didIO information. I'll try to find another way to implement it.
      
      Test Plan: make check
      
      Reviewers: sdong, haobo, yhchiang, ljin, dhruba
      
      Reviewed By: ljin
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D19161
      d4a84233
  22. 17 6月, 2014 1 次提交
    • S
      Refactor: group metadata needed to open an SST file to a separate copyable struct · cadc1adf
      sdong 提交于
      Summary:
      We added multiple fields to FileMetaData recently and are planning to add more.
      This refactoring separate the minimum information for accessing the file. This object is copyable (FileMetaData is not copyable since the ref counter). I hope this refactoring can enable further improvements:
      
      (1) use it to design a more efficient data structure to speed up read queries.
      (2) in the future, when we add information of storage level, we can easily do the encoding, instead of enlarge this structure, which might expand memory work set for file meta data.
      
      The definition is same as current EncodedFileMetaData used in two level iterator, so now the logic in two level iterator is easier to understand.
      
      Test Plan: make all check
      
      Reviewers: haobo, igor, ljin
      
      Reviewed By: ljin
      
      Subscribers: leveldb, dhruba, yhchiang
      
      Differential Revision: https://reviews.facebook.net/D18933
      cadc1adf
  23. 03 6月, 2014 1 次提交
    • S
      In DB::NewIterator(), try to allocate the whole iterator tree in an arena · df9069d2
      sdong 提交于
      Summary:
      In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
      1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
      2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
      3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.
      
      Limitations:
      (1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
      (2) Two level iterator itself is allocated in arena, but not iterators inside it.
      
      Test Plan: make all check
      
      Reviewers: ljin, haobo
      
      Reviewed By: haobo
      
      Subscribers: leveldb, dhruba, yhchiang, igor
      
      Differential Revision: https://reviews.facebook.net/D18513
      df9069d2
  24. 26 4月, 2014 3 次提交
  25. 18 4月, 2014 2 次提交
    • S
      Fix bugs introduced by D17961 · 65179225
      sdong 提交于
      Summary:
      D17961 has two bugs:
      (1) two level iterator fails to populate FileMetaData.table_reader, causing performance regression.
      (2) table cache handle the !status.ok() case in the wrong place, causing seg fault which shouldn't happen.
      
      Test Plan: make all check
      
      Reviewers: ljin, igor, haobo
      
      Reviewed By: ljin
      
      CC: yhchiang, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D17991
      65179225
    • S
      Minimize accessing multiple objects in Version::Get() · fa430bfd
      sdong 提交于
      Summary:
      One of our profilings shows that Version::Get() sometimes is slow when getting pointer of user comparators or other global objects. In this patch:
      (1) we keep pointers of immutable objects in Version to avoid accesses them though option objects or cfd objects
      (2) table_reader is directly cached in FileMetaData so that table cache don't have to go through handle first to fetch it
      (3) If level 0 has less than 3 files, skip the filtering logic based on SST tables' key range. Smallest and largest key are stored in separated memory locations, which has potential cache misses
      
      Test Plan: make all check
      
      Reviewers: haobo, ljin
      
      Reviewed By: haobo
      
      CC: igor, yhchiang, nkg-, leveldb
      
      Differential Revision: https://reviews.facebook.net/D17739
      fa430bfd
  26. 26 3月, 2014 1 次提交
    • S
      When Options.max_num_files=-1, non level0 files also by pass table cache · 6b2e7a2a
      sdong 提交于
      Summary:
      This is the part that was not finished when doing the Options.max_num_files=-1 feature. For iterating non level0 SST files (which was done using two level iterator), table cache is not bypassed. With this patch, the leftover feature is done.
      
      Test Plan: make all check; change Options.max_num_files=-1 in one of the tests to cover the codes.
      
      Reviewers: haobo, igor, dhruba, ljin, yhchiang
      
      Reviewed By: haobo
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D17001
      6b2e7a2a
  27. 14 2月, 2014 1 次提交
  28. 06 2月, 2014 2 次提交
    • I
      [CF] Add full_options_ to ColumnFamilyData · 6e56ab57
      Igor Canadi 提交于
      Summary:
      Lots of code expects Options on construction/function call. My original idea was to split Options argument into ColumnFamilyOptions and DBOptions (the latter only if needed). However, this will require huge code changes very deep in the stack.
      
      The better idea is to have ColumnFamilyData hold both ColumnFamilyOptions and Options. ColumnFamilyData::Options would be constructed from DBOptions (same for each column family) and ColumnFamilyOptions (different for each column family)
      
      Now when we construct a class or call any method that requires Options, we can just push him ColumnFamilyData::Options and be sure that it's using column-family-specific settings.
      
      Test Plan: make check
      
      Reviewers: dhruba, haobo, kailiu, sdong
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15927
      6e56ab57
    • I
      [CF] Rethink table cache · c24d8c4e
      Igor Canadi 提交于
      Summary:
      Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
      
      To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
      
      This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
      
      Test Plan: make check
      
      Reviewers: dhruba, haobo, kailiu, sdong
      
      CC: leveldb
      
      Differential Revision: https://reviews.facebook.net/D15915
      c24d8c4e
  29. 04 2月, 2014 1 次提交
  30. 03 2月, 2014 1 次提交
  31. 11 1月, 2014 1 次提交
    • S
      [Performance Branch] If options.max_open_files set to be -1, cache table... · aa0ef660
      Siying Dong 提交于
      [Performance Branch] If options.max_open_files set to be -1, cache table readers in FileMetadata for Get() and NewIterator()
      
      Summary:
      In some use cases, table readers for all live files should always be cached. In that case, there will be an opportunity to avoid the table cache look-up while Get() and NewIterator().
      
      We define options.max_open_files = -1 to be the mode that table readers for live files will always be kept. In that mode, table readers are cached in FileMetaData (with a reference count hold in table cache). So that when executing table_cache.Get() and table_cache.newInterator(), LRU cache checking can be by-passed, to reduce latency.
      
      Test Plan: add a test case in db_test
      
      Reviewers: haobo, kailiu
      
      Reviewed By: haobo
      
      CC: dhruba, igor, leveldb
      
      Differential Revision: https://reviews.facebook.net/D15039
      aa0ef660
  32. 03 1月, 2014 1 次提交
  33. 27 12月, 2013 1 次提交
  34. 26 11月, 2013 1 次提交