1. 18 3月, 2021 1 次提交
    • A
      Use SST file manager to track blob files as well (#8037) · 27d57a03
      Akanksha Mahajan 提交于
      Summary:
      Extend support to track blob files in SST File manager.
       This PR notifies SstFileManager whenever a new blob file is created,
       via OnAddFile and  an obsolete blob file deleted via OnDeleteFile
       and delete file via ScheduleFileDeletion.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8037
      
      Test Plan: Add new unit tests
      
      Reviewed By: ltamasi
      
      Differential Revision: D26891237
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 04c69ccfda2a73782fd5c51982dae58dd11979b6
      27d57a03
  2. 13 11月, 2020 1 次提交
    • Y
      Add full_history_ts_low_ to FlushJob (#7655) · 76ef894f
      Yanqin Jin 提交于
      Summary:
      https://github.com/facebook/rocksdb/issues/7556 enables `CompactionIterator` to perform garbage collection during compaction according
      to a lower bound (user-defined) timestamp `full_history_ts_low_`.
      This PR adds a data member `full_history_ts_low_` of type `std::string` to `FlushJob`, and
      `full_history_ts_low_` does not change during flush. `FlushJob` will pass a pointer to this data member
      to the `CompactionIterator` used during flush.
      
      Also refactored flush_job_test.cc to re-use some existing code, which is actually the majority of this PR.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7655
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D24933340
      
      Pulled By: riversand963
      
      fbshipit-source-id: 2e584bfd0cf6e5c295ab1af264e68e9d6a12fca3
      76ef894f
  3. 15 9月, 2020 1 次提交
    • L
      Integrate blob file writing with the flush logic (#7345) · b0e78341
      Levi Tamasi 提交于
      Summary:
      The patch adds support for writing blob files during flush by integrating
      `BlobFileBuilder` with the flush logic, most importantly, `BuildTable` and
      `CompactionIterator`. If `enable_blob_files` is set, large values are extracted
      to blob files and replaced with references. The resulting blob files are then
      logged to the MANIFEST as part of the flush job's `VersionEdit` and
      added to the `Version`, similarly to table files. Errors related to writing
      blob files fail the flush, and any blob files written by such jobs are immediately
      deleted (again, similarly to how SST files are handled). In addition, the patch
      extends the logging and statistics around flushes to account for the presence
      of blob files (e.g. `InternalStats::CompactionStats::bytes_written`, which is
      used for calculating write amplification, now considers the blob files as well).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7345
      
      Test Plan: Tested using `make check` and `db_bench`.
      
      Reviewed By: riversand963
      
      Differential Revision: D23506369
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 646885f22dfbe063f650d38a1fedc132f499a159
      b0e78341
  4. 09 9月, 2020 1 次提交
    • A
      Store FSWritableFilePtr object in WritableFileWriter (#7193) · b175eceb
      Akanksha Mahajan 提交于
      Summary:
      Replace FSWritableFile pointer with FSWritableFilePtr
          object in WritableFileWriter.
          This new object wraps FSWritableFile pointer.
      
          Objective: If tracing is enabled, FSWritableFile Ptr returns
          FSWritableFileTracingWrapper pointer that includes all necessary
          information in IORecord and calls underlying FileSystem and invokes
          IOTracer to dump that record in a binary file. If tracing is disabled
          then, underlying FileSystem pointer is returned directly.
          FSWritableFilePtr wrapper class is added to bypass the
          FSWritableFileWrapper when
          tracing is disabled.
      
          Test Plan: make check -j64
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7193
      
      Reviewed By: anand1976
      
      Differential Revision: D23355915
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: e62a27a13c1fd77e36a6dbafc7006d969bed25cf
      b175eceb
  5. 18 6月, 2020 1 次提交
    • Z
      Store DB identity and DB session ID in SST files (#6983) · 94d04529
      Zitan Chen 提交于
      Summary:
      `db_id` and `db_session_id` are now part of the table properties for all formats and stored in SST files. This adds about 99 bytes to each new SST file.
      
      The `TablePropertiesNames` for these two identifiers are `rocksdb.creating.db.identity` and `rocksdb.creating.session.identity`.
      
      In addition, SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`.
      
      A table property test is added.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6983
      
      Test Plan: make check and some manual tests.
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D22048826
      
      Pulled By: gg814
      
      fbshipit-source-id: afdf8c11424a6f509b5c0b06dafad584a80103c9
      94d04529
  6. 28 3月, 2020 1 次提交
    • Z
      Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487) · 42468881
      Zhichao Cao 提交于
      Summary:
      In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
      
      The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
      
      Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
      
      Reviewed By: anand1976
      
      Differential Revision: D20685017
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
      42468881
  7. 21 2月, 2020 1 次提交
    • S
      Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) · fdf882de
      sdong 提交于
      Summary:
      When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
      
      Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
      
      Differential Revision: D19977691
      
      fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
      fdf882de
  8. 14 12月, 2019 1 次提交
    • A
      Introduce a new storage specific Env API (#5761) · afa2420c
      anand76 提交于
      Summary:
      The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
      
      This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
      
      The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
      
      This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
      
      The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
      
      Differential Revision: D18868376
      
      Pulled By: anand1976
      
      fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
      afa2420c
  9. 01 6月, 2019 1 次提交
  10. 11 4月, 2019 1 次提交
    • S
      Periodic Compactions (#5166) · d3d20dcd
      Sagar Vemuri 提交于
      Summary:
      Introducing Periodic Compactions.
      
      This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted.  And also, of course, it helps to cleanup data older than certain threshold.
      
      - Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
      - This works across all levels.
      - The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
      - Compaction filters, if any, are invoked as usual.
      - A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
      
      This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
      
      Differential Revision: D14884441
      
      Pulled By: sagar0
      
      fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
      d3d20dcd
  11. 19 3月, 2019 1 次提交
    • S
      Feature for sampling and reporting compressibility (#4842) · b45b1cde
      Shobhit Dayal 提交于
      Summary:
      This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms:
      1. lz4 or snappy for fast compression
      2. zstd or zlib for slow but higher compression.
      
      The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType.
      
      The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842
      
      Differential Revision: D13629011
      
      Pulled By: shobhitdayal
      
      fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa
      b45b1cde
  12. 15 2月, 2019 1 次提交
    • A
      Dictionary compression for files written by SstFileWriter (#4978) · c8c8104d
      Andrew Kryczka 提交于
      Summary:
      If `CompressionOptions::max_dict_bytes` and/or `CompressionOptions::zstd_max_train_bytes` are set, `SstFileWriter` will now generate files respecting those options.
      
      I refactored the logic a bit for deciding when to use dictionary compression. Previously we plumbed `is_bottommost_level` down to the table builder and used that. However it was kind of confusing in `SstFileWriter`'s context since we don't know what level the file will be ingested to. Instead, now the higher-level callers (e.g., flush, compaction, file writer) are responsible for building the right `CompressionOptions` to give the table builder.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4978
      
      Differential Revision: D14060763
      
      Pulled By: ajkr
      
      fbshipit-source-id: dc802c327896df2b319dc162d6acc82b9cdb452a
      c8c8104d
  13. 12 2月, 2019 1 次提交
    • A
      Reduce scope of compression dictionary to single SST (#4952) · 62f70f6d
      Andrew Kryczka 提交于
      Summary:
      Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
      
      So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
      
      - The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
      - After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
      - Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
      
      Differential Revision: D13967980
      
      Pulled By: ajkr
      
      fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
      62f70f6d
  14. 18 12月, 2018 1 次提交
  15. 10 8月, 2018 1 次提交
    • M
      Index value delta encoding (#3983) · caf0f53a
      Maysam Yabandeh 提交于
      Summary:
      Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the  block cache more efficiently. The feature is enabled with using format_version 4.
      
      The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
      Results with sysbench read-only using 4k blocks and using 16 index restart interval:
      Format 2:
      19585   rocksdb read-only range=100
      Format 3:
      19569   rocksdb read-only range=100
      Format 4:
      19352   rocksdb read-only range=100
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983
      
      Differential Revision: D8361343
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
      caf0f53a
  16. 22 5月, 2018 1 次提交
    • Z
      Move prefix_extractor to MutableCFOptions · c3ebc758
      Zhongyi Xie 提交于
      Summary:
      Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
      This PR aims to make it possible to dynamically change bloom filter config.
      Closes https://github.com/facebook/rocksdb/pull/3601
      
      Differential Revision: D7253114
      
      Pulled By: miasantreble
      
      fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
      c3ebc758
  17. 11 11月, 2017 1 次提交
  18. 28 10月, 2017 1 次提交
    • Y
      TableProperty::oldest_key_time defaults to 0 · 84a04af9
      Yi Wu 提交于
      Summary:
      We don't propagate TableProperty::oldest_key_time on compaction and just write the default value to SST files. It is more natural to default the value to 0.
      
      Also revert db_sst_test back to before #2842.
      Closes https://github.com/facebook/rocksdb/pull/3079
      
      Differential Revision: D6165702
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: ca3ce5928d96ae79a5beb12bb7d8c640a71478a0
      84a04af9
  19. 24 10月, 2017 1 次提交
    • Y
      Add DB::Properties::kEstimateOldestKeyTime · 66a2c44e
      Yi Wu 提交于
      Summary:
      With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property.
      
      My plan is to override the property with a more accurate value with blob db, where we actually have timestamp.
      Closes https://github.com/facebook/rocksdb/pull/2842
      
      Differential Revision: D5770600
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a
      66a2c44e
  20. 07 10月, 2017 1 次提交
    • Y
      WritePrepared Txn: Compaction/Flush · d1b74b0c
      Yi Wu 提交于
      Summary:
      Update Compaction/Flush to support WritePreparedTxnDB: Add SnapshotChecker which is a proxy to query WritePreparedTxnDB::IsInSnapshot. Pass SnapshotChecker to DBImpl on WritePreparedTxnDB open. CompactionIterator use it to check if a key has been committed and if it is visible to a snapshot. In CompactionIterator:
      * check if key has been committed. If not, output uncommitted keys AS-IS.
      * use SnapshotChecker to check if key is visible to a snapshot when in need.
      * do not output key with seq = 0 if the key is not committed.
      Closes https://github.com/facebook/rocksdb/pull/2926
      
      Differential Revision: D5902907
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 945e037fdf0aa652dc5ba0ad879461040baa0320
      d1b74b0c
  21. 16 7月, 2017 1 次提交
  22. 28 6月, 2017 1 次提交
    • S
      FIFO Compaction with TTL · 1cd45cd1
      Sagar Vemuri 提交于
      Summary:
      Introducing FIFO compactions with TTL.
      
      FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size.
      
      To address that request:
      - Added a new TTL option to FIFO compaction options.
      - Updated FIFO compaction score to take TTL into consideration.
      - Added a new table property, creation_time, to keep track of when the SST file is created.
      - Creation_time is set as below:
        - On Flush: Set to the time of flush.
        - On Compaction: Set to the max creation_time of all the files involved in the compaction.
        - On Repair and Recovery: Set to the time of repair/recovery.
        - Old files created prior to this code change will have a creation_time of 0.
      - FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day.
      - FIFO compaction will fall back to the prior way of deleting files based on size if:
        - the creation_time of all files involved in compaction is 0.
        - the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted.
      
      This feature is not supported if max_open_files != -1 or with table formats other than Block-based.
      
      **Test Plan:**
      Added tests.
      
      **Benchmark results:**
      Base: FIFO with max size: 100MB ::
      ```
      svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100
      
      readwhilewriting :       1.924 micros/op 519858 ops/sec;   13.6 MB/s (1176277 of 5000000 found)
      ```
      
      With TTL (a low one for testing) ::
      ```
      svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20
      
      readwhilewriting :       1.902 micros/op 525817 ops/sec;   13.7 MB/s (1185057 of 5000000 found)
      ```
      Example Log lines:
      ```
      2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion
      2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files
      ...
      2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK
      2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40}
      ```
      
      SST Files remaining in the dbbench dir, after db_bench execution completed:
      ```
      svemuri@dev15905 ~/rocksdb (fifo-compaction)  $ ls -l /dev/shm//dbbench/*.sst
      -rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst
      -rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst
      -rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst
      ```
      Closes https://github.com/facebook/rocksdb/pull/2480
      
      Differential Revision: D5305116
      
      Pulled By: sagar0
      
      fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174
      1cd45cd1
  23. 28 4月, 2017 1 次提交
  24. 06 4月, 2017 1 次提交
  25. 20 11月, 2016 1 次提交
  26. 01 11月, 2016 1 次提交
    • A
      DeleteRange flush support · 40a2e406
      Andrew Kryczka 提交于
      Summary:
      Changed BuildTable() (used for flush) to (1) add range
      tombstones to the aggregator, which is used by CompactionIterator to
      determine which keys can be removed; and (2) add aggregator's range
      tombstones to the table that is output for the flush.
      Closes https://github.com/facebook/rocksdb/pull/1438
      
      Differential Revision: D4100025
      
      Pulled By: ajkr
      
      fbshipit-source-id: cb01a70
      40a2e406
  27. 18 9月, 2016 1 次提交
  28. 03 9月, 2016 1 次提交
  29. 18 5月, 2016 1 次提交
    • A
      [rocksdb] make more options dynamic · 43afd72b
      Aaron Gao 提交于
      Summary:
      make more ColumnFamilyOptions dynamic:
      - compression
      - soft_pending_compaction_bytes_limit
      - hard_pending_compaction_bytes_limit
      - min_partial_merge_operands
      - report_bg_io_stats
      - paranoid_file_checks
      
      Test Plan:
      Add sanity check in `db_test.cc` for all above options except for soft_pending_compaction_bytes_limit and hard_pending_compaction_bytes_limit.
      All passed.
      
      Reviewers: andrewkr, sdong, IslamAbdelRahman
      
      Reviewed By: IslamAbdelRahman
      
      Subscribers: andrewkr, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D57519
      43afd72b
  30. 30 4月, 2016 1 次提交
    • Y
      Added EventListener::OnTableFileCreationStarted() callback · a92049e3
      Yi Wu 提交于
      Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.
      
      Test Plan: unit test.
      
      Reviewers: dhruba, yhchiang, ott, sdong
      
      Reviewed By: sdong
      
      Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba
      
      Differential Revision: https://reviews.facebook.net/D56337
      a92049e3
  31. 28 4月, 2016 1 次提交
    • A
      Shared dictionary compression using reference block · 843d2e31
      Andrew Kryczka 提交于
      Summary:
      This adds a new metablock containing a shared dictionary that is used
      to compress all data blocks in the SST file. The size of the shared dictionary
      is configurable in CompressionOptions and defaults to 0. It's currently only
      used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
      the compression type if the user chooses a nonzero dictionary size.
      
      During compaction, computes the dictionary by randomly sampling the first
      output file in each subcompaction. It pre-computes the intervals to sample
      by assuming the output file will have the maximum allowable length. In case
      the file is smaller, some of the pre-computed sampling intervals can be beyond
      end-of-file, in which case we skip over those samples and the dictionary will
      be a bit smaller. After the dictionary is generated using the first file in a
      subcompaction, it is loaded into the compression library before writing each
      block in each subsequent file of that subcompaction.
      
      On the read path, gets the dictionary from the metablock, if it exists. Then,
      loads that dictionary into the compression library before reading each block.
      
      Test Plan: new unit test
      
      Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D52287
      843d2e31
  32. 07 4月, 2016 1 次提交
    • A
      Embed column family name in SST file · 2391ef72
      Andrew Kryczka 提交于
      Summary:
      Added the column family name to the properties block. This property
      is omitted only if the property is unavailable, such as when RepairDB()
      writes SST files.
      
      In a next diff, I will change RepairDB to use this new property for
      deciding to which column family an existing SST file belongs. If this
      property is missing, it will add it to the "unknown" column family (same
      as its existing behavior).
      
      Test Plan:
      New unit test:
      
        $ ./db_table_properties_test --gtest_filter=DBTablePropertiesTest.GetColumnFamilyNameProperty
      
      Reviewers: IslamAbdelRahman, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D55605
      2391ef72
  33. 02 4月, 2016 1 次提交
    • M
      Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes. · 9b519875
      Marton Trencseni 提交于
      Summary:
      When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
      What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
      
      Test Plan:
      'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
      I didn't run the Java tests, I don't have Java set up on my devserver.
      
      Reviewers: sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D56133
      9b519875
  34. 22 3月, 2016 1 次提交
  35. 18 3月, 2016 1 次提交
    • M
      Adding pin_l0_filter_and_index_blocks_in_cache feature. · 522de4f5
      Marton Trencseni 提交于
      Summary:
      When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
      What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
      When the table reader is destroyed, it releases the pinned blocks (if there were any). This has to happen before the cache is destroyed, so I had to introduce a TableReader::Close(), to guarantee the order of destruction.
      
      Test Plan:
      Added two unit tests for this. Existing unit tests run fine (default is pin_l0_filter_and_index_blocks_in_cache=false).
      
      DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32
        Mac: OK.
        Linux: with D55287 patched in it's OK.
      
      Reviewers: sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D54801
      522de4f5
  36. 10 2月, 2016 1 次提交
  37. 12 12月, 2015 1 次提交
    • A
      Use SST files for Transaction conflict detection · 3bfd3d39
      agiardullo 提交于
      Summary:
      Currently, transactions can fail even if there is no actual write conflict.  This is due to relying on only the memtables to check for write-conflicts.  Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings.
      
      With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts.  This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot.  Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged).
      
      Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread.  Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files.
      
      Test Plan: unit tests, db bench
      
      Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb, yoshinorim
      
      Differential Revision: https://reviews.facebook.net/D50475
      3bfd3d39
  38. 31 10月, 2015 1 次提交
  39. 30 10月, 2015 1 次提交
  40. 14 10月, 2015 1 次提交
    • S
      Seperate InternalIterator from Iterator · 35ad531b
      sdong 提交于
      Summary:
      Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.
      
      This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
      At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.
      
      Test Plan: Run all existing tests.
      
      Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven
      
      Reviewed By: rven
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D48549
      35ad531b