diff --git a/README.md b/README.md
index 8fa34d76a857d50f0b19374e079d29836feafd9e..6777205de8134a98955eee7f80a882bb3887aa7a 100644
--- a/README.md
+++ b/README.md
@@ -13,23 +13,26 @@ flexible, catering to research needs.
 
 ## Introduction
 
-Design Principles:
+Features:
 
 - Production Ready:
-Key operations are implemented in C++ and CUDA, together with PaddlePaddle's
+
+  Key operations are implemented in C++ and CUDA, together with PaddlePaddle's
 highly efficient inference engine, enables easy deployment in server environments.
 
 - Highly Flexible:
-Components are designed to be modular. Model architectures, as well as data
+
+  Components are designed to be modular. Model architectures, as well as data
 preprocess pipelines, can be easily customized with simple configuration
 changes.
 
 - Performance Optimized:
-With the help of the underlying PaddlePaddle framework, faster training and
+
+  With the help of the underlying PaddlePaddle framework, faster training and
 reduced GPU memory footprint is achieved. Notably, Yolo V3 training is
 much faster compared to other frameworks. Another example is Mask-RCNN
-(ResNet50), we managed to fit up to 5 images per GPU (V100 16GB) during
-training.
+(ResNet50), we managed to fit up to 4 images per GPU (Tesla V100 16GB) during
+multi-GPU training.
 
 Supported Architectures:
 
@@ -44,7 +47,7 @@ Supported Architectures:
 | Yolov3             | ✓      |                             ✗ | ✗       | ✗     | ✓         | ✓       |
 | SSD                | ✗      |                             ✗ | ✗       | ✗     | ✓         | ✗       |
 
-<a name="vd">[1]</a> ResNet-vd models offer much improved accuracy with negligible performance cost.
+<a name="vd">[1]</a> [ResNet-vd](https://arxiv.org/pdf/1812.01187) models offer much improved accuracy with negligible performance cost.
 
 Advanced Features:
 
@@ -67,7 +70,7 @@ Please follow the [installation guide](docs/INSTALL.md).
 ## Get Started
 
 For inference, simply run the following command and the visualized result will
-be saved in `output/`.
+be saved in `output`.
 
 ```bash
 export PYTHONPATH=`pwd`:$PYTHONPATH
@@ -102,6 +105,7 @@ Some of the planned features include:
 ## Updates
 
 #### Initial release (7/3/2019)
+
 - Initial release of PaddleDetection and detection model zoo
 - Models included: Faster R-CNN, Mask R-CNN, Faster R-CNN+FPN, Mask
   R-CNN+FPN, Cascade-Faster-RCNN+FPN, RetinaNet, Yolo v3, and SSD.
diff --git a/docs/GETTING_STARTED.md b/docs/GETTING_STARTED.md
index b610f5a683002ab167464a69b163924e6d120519..1612500b2b39e674d42c09f1aad5ee7df32b8395 100644
--- a/docs/GETTING_STARTED.md
+++ b/docs/GETTING_STARTED.md
@@ -75,8 +75,13 @@ path, simply add a `--save_file=` flag.
 
 ## FAQ
 
+**Q:**  Why do I get `NaN` loss values during single GPU training? </br>
+**A:**  The default learning rate is tuned to multi-GPU training (8x GPUs), it must
+be adapted for single GPU training accordingly (e.g., divide by 8).
 
-Q: Why do I get `NaN` loss values during single GPU training?
 
-A: The default learning rate is tuned to multi-GPU training (8x GPUs), it must
-be adapted for single GPU training accordingly (e.g., divide by 8).
+**Q:**  How to reduce GPU memory usage? </br>
+**A:**  Setting environment variable FLAGS_conv_workspace_size_limit to a smaller
+number can reduce GPU memory footprint without affecting training speed.
+Take Mask-RCNN (R50) as example, by setting `export FLAGS_conv_workspace_size_limit=512`,
+batch size could reach 4 per GPU (Tesla V100 16GB).