diff --git a/README.md b/README.md
index 682c2202ce43d6e61464621316cfc342df8431f8..97315d23e71ac3a0260a2ee9274db0a5f0b41eec 100644
--- a/README.md
+++ b/README.md
@@ -35,7 +35,7 @@ Users only need to call the ```sequence_ops``` functions provided by Paddle to e
 
 Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. In our experiments, PGL can reach up to 13 times the speed of DGL with complex user-defined functions. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.
 
-## Performance
+### Performance
 
 
 We test all the following GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping.
@@ -82,7 +82,7 @@ In most cases of large-scale graph learning, we need distributed graph storage a
 
 
 ## Highlight: Tons of Models
-The following are 13 graph learning models that have been implemented in the framework.
+The following are 13 graph learning models that have been implemented in the framework. See the details [here](https://pgl.readthedocs.io/en/latest/introduction.html#tons-of-models)
 
 |Model | feature |
 |---|---|
diff --git a/docs/source/md/introduction.md b/docs/source/md/introduction.md
index ec7a4bfe60604b5e7984843eb0c660e7ba391ede..e0474ef8cd814d2683e8af513afe72d79ff1fac2 100644
--- a/docs/source/md/introduction.md
+++ b/docs/source/md/introduction.md
@@ -41,7 +41,7 @@ Users only need to call the ``sequence_ops`` functions provided by Paddle to eas
 
 Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. In our experiments, PGL can reach up to 13 times the speed of DGL with complex user-defined functions. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.
 
-## Performance
+### Performance
 
 
 We test all the following GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping.