Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
Greenplum
Annotated Deep Learning Paper Implementations
提交
81f6b55a
A
Annotated Deep Learning Paper Implementations
项目概览
Greenplum
/
Annotated Deep Learning Paper Implementations
10 个月 前同步成功
通知
6
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
DevOps
流水线
流水线任务
计划
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
A
Annotated Deep Learning Paper Implementations
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
DevOps
DevOps
流水线
流水线任务
计划
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
流水线任务
提交
Issue看板
前往新版Gitcode,体验更适合开发者的 AI 搜索 >>
提交
81f6b55a
编写于
11月 10, 2020
作者:
V
Varuna Jayasiri
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
📚
knn-lm index
上级
2f937806
变更
3
隐藏空白更改
内联
并排
Showing
3 changed file
with
75 addition
and
1 deletion
+75
-1
labml_nn/transformers/knn/__init__.py
labml_nn/transformers/knn/__init__.py
+5
-0
labml_nn/transformers/knn/build_index.py
labml_nn/transformers/knn/build_index.py
+68
-1
labml_nn/transformers/knn/train_model.py
labml_nn/transformers/knn/train_model.py
+2
-0
未找到文件。
labml_nn/transformers/knn/__init__.py
浏览文件 @
81f6b55a
...
...
@@ -21,4 +21,9 @@ So to run $k$NN-LM we need to:
* [Build an index](build_index.html) of $
\b
ig(f(c_i), w_i
\b
ig)$
* [Evaluate kNN-ML](eval_knn.html) using $k$NN seach on $
\b
ig(f(c_i), w_i
\b
ig)$
with $f(c_t)$
This experiment uses a small dataset so that we can run this without using up a few hundred giga-bytes
of disk space for the index.
The official implementation of $k$NN-LM can be found [here](https://github.com/urvashik/knnlm).
"""
labml_nn/transformers/knn/build_index.py
浏览文件 @
81f6b55a
"""
# Index $$
\b
ig(f(c_i), w_i
\b
ig)$
We store $f(c_i)$ and $w_i$ in memory mapped numpy arrays.
We find $f(c_i)$ nearest to $f(c_t)$ using [FAISS](https://github.com/facebookresearch/faiss).
FAISS indexes $
\b
ig(f(c_i), i
\b
ig)$ and we query it with $f(c_t)$.
"""
from
typing
import
Optional
import
faiss
...
...
@@ -10,71 +18,130 @@ from labml_nn.transformers.knn.train_model import Configs
def
load_experiment
(
run_uuid
:
str
,
checkpoint
:
Optional
[
int
]
=
None
):
"""
Load a saved experiment from [train model](train_model.html).
"""
# Create configurations object
conf
=
Configs
()
# Load custom configurations used in the experiment
conf_dict
=
experiment
.
load_configs
(
run_uuid
)
# We need to get inputs to the feed forward layer, $f(c_i)$
conf_dict
[
'is_save_ff_input'
]
=
True
# This experiment is just an evaluation; i.e. nothing is tracked or saved
experiment
.
evaluate
()
# Initialize configurations
experiment
.
configs
(
conf
,
conf_dict
,
'run'
)
# Set models for saving/loading
experiment
.
add_pytorch_models
(
get_modules
(
conf
))
# Specify the experiment to load from
experiment
.
load
(
run_uuid
,
checkpoint
)
# Start the experiment; this is when it actually loads models
experiment
.
start
()
return
conf
def
gather_keys
(
conf
:
Configs
):
"""
## Gather $
\b
ig(f(c_i), w_i
\b
ig)$ and save them in numpy arrays
*Note that these numpy arrays will take up a lot of space (even few hundred gigabytes)
depending on the size of your dataset*.
"""
# Dimensions of $f(c_i)$
d_model
=
conf
.
transformer
.
d_model
# Training data loader
data_loader
=
conf
.
trainer
.
data_loader
# Number of contexts; i.e. number of tokens in the training data minus one.
# $\big(f(c_i), w_i\big)$ for $i \in [2, T]$
n_keys
=
data_loader
.
data
.
shape
[
0
]
*
data_loader
.
data
.
shape
[
1
]
-
1
# Numpy array for $f(c_i)$
keys_store
=
np
.
memmap
(
str
(
lab
.
get_data_path
()
/
'keys.npy'
),
dtype
=
np
.
float32
,
mode
=
'w+'
,
shape
=
(
n_keys
,
d_model
))
# Numpy array for $w_i$
vals_store
=
np
.
memmap
(
str
(
lab
.
get_data_path
()
/
'vals.npy'
),
dtype
=
np
.
int
,
mode
=
'w+'
,
shape
=
(
n_keys
,
1
))
# Number of keys $f(c_i)$ collected
added
=
0
with
torch
.
no_grad
():
# Loop through data
for
i
,
batch
in
monit
.
enum
(
"Collect data"
,
data_loader
,
is_children_silent
=
True
):
# $w_i$ the target labels
vals
=
batch
[
1
].
view
(
-
1
,
1
)
# Input data moved to the device of the model
data
=
batch
[
0
].
to
(
conf
.
device
)
# Run the model
_
=
conf
.
model
(
data
)
# Get $f(c_i)$
keys
=
conf
.
model
.
ff_input
.
view
(
-
1
,
d_model
)
keys
=
keys
# / torch.sqrt((keys ** 2).sum(-1, keepdims=True) + 1e-10)
# Save keys, $f(c_i)$ in the memory mapped numpy array
keys_store
[
added
:
added
+
keys
.
shape
[
0
]]
=
keys
.
cpu
()
# Save values, $w_i$ in the memory mapped numpy array
vals_store
[
added
:
added
+
keys
.
shape
[
0
]]
=
vals
# Increment the number of collected keys
added
+=
keys
.
shape
[
0
]
def
build_index
(
conf
:
Configs
,
n_centeroids
:
int
=
2048
,
code_size
:
int
=
64
,
n_probe
:
int
=
8
,
n_train
:
int
=
200_000
):
"""
## Build FAISS index
[Getting started](https://github.com/facebookresearch/faiss/wiki/Getting-started),
[faster search](https://github.com/facebookresearch/faiss/wiki/Faster-search),
and [lower memory footprint)(https://github.com/facebookresearch/faiss/wiki/Lower-memory-footprint)
tutorials on FAISS will help you learn more about FAISS usage.
"""
# Dimensions of $f(c_i)$
d_model
=
conf
.
transformer
.
d_model
# Training data loader
data_loader
=
conf
.
trainer
.
data_loader
# Number of contexts; i.e. number of tokens in the training data minus one.
# $\big(f(c_i), w_i\big)$ for $i \in [2, T]$
n_keys
=
data_loader
.
data
.
shape
[
0
]
*
data_loader
.
data
.
shape
[
1
]
-
1
# Build an index with Verenoi cell based faster search with compression that
# doesn't store full vectors.
quantizer
=
faiss
.
IndexFlatL2
(
d_model
)
index
=
faiss
.
IndexIVFPQ
(
quantizer
,
d_model
,
n_centeroids
,
code_size
,
8
)
index
.
nprobe
=
n_probe
# Load the memory mapped numpy array of keys
keys_store
=
np
.
memmap
(
str
(
lab
.
get_data_path
()
/
'keys.npy'
),
dtype
=
np
.
float32
,
mode
=
'r'
,
shape
=
(
n_keys
,
d_model
))
# Pick a random sample of keys to train the index with
random_sample
=
np
.
random
.
choice
(
np
.
arange
(
n_keys
),
size
=
[
min
(
n_train
,
n_keys
)],
replace
=
False
)
with
monit
.
section
(
'Train index'
):
# Train the index to store the keys
index
.
train
(
keys_store
[
random_sample
])
# Add keys to the index; $\big(f(c_i), i\big)$
for
s
in
monit
.
iterate
(
'Index'
,
range
(
0
,
n_keys
,
1024
)):
e
=
min
(
s
+
1024
,
n_keys
)
# $f(c_i)$
keys
=
keys_store
[
s
:
e
]
# $i$
idx
=
np
.
arange
(
s
,
e
)
# Add to index
index
.
add_with_ids
(
keys
,
idx
)
with
monit
.
section
(
'Save'
):
# Save the index
faiss
.
write_index
(
index
,
str
(
lab
.
get_data_path
()
/
'faiss.index'
))
def
main
():
# Load the experiment
conf
=
load_experiment
(
'4984b85c20bf11eb877a69c1a03717cd'
)
# Set model to evaluation mode
conf
.
model
.
eval
()
# Collect $\big(f(c_i), w_i\big)$
gather_keys
(
conf
)
# Add them to the index for fast search
build_index
(
conf
)
...
...
labml_nn/transformers/knn/train_model.py
浏览文件 @
81f6b55a
...
...
@@ -251,6 +251,8 @@ def character():
def
tiny_shakespeare
(
c
:
Configs
):
"""
Initialize/load tiny shakespeare dataset
This dataset is from Andrej Karpathy's [char-rnn](https://github.com/karpathy/char-rnn) project.
"""
return
TextFileDataset
(
lab
.
get_data_path
()
/
'tiny_shakespeare.txt'
,
c
.
tokenizer
,
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录