mv files and complete 20 issues

9d6333a4 · wnma3mz · e6974bcb · e6974bcb · e6974bcb · 9d6333a4
27 changed file
--- a/20171020 第18期/PerceptuallyUniformColorSpaces.md
+++ b/20171020 第18期/PerceptuallyUniformColorSpaces.md
--- a/20171020 第18期/StopUsingWord2vec.md
+++ b/20171020 第18期/StopUsingWord2vec.md
-# Stop Using word2vec
-
-原文链接：[Stop Using word2vec](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
-
-When I started playing with word2vec four years ago I needed (and luckily had) tons of supercomputer time. But because of advances in our understanding of word2vec, computing word vectors now takes fifteen minutes on a single run-of-the-mill computer with standard numerical libraries[1](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#1). Word vectors are [awesome](http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/) but you don’t need a neural network – and definitely don’t need deep learning – to find them[2](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#2). So if you’re using word vectors and aren’t gunning for state of the art or a paper publication then *stop using word2vec.*
-
-When we’re finished you’ll measure word similarities:
-
-```
-facebook ~ twitter, google, ...
-```
-
-… and the classic word vector operations: `zuckerberg - facebook + microsoft ~  nadella`
-
-…but you’ll do it mostly by counting words and dividing, no gradients harmed in the making!
-
-## The recipe
-
-Let’s assume you’ve already gotten your hands on the [Hacker News corpus](https://cloud.google.com/bigquery/public-data/hacker-news) and cleaned & tokenized it (or downloaded the preprocessed version [here](https://zenodo.org/record/49899)). Here’s the method:
-
-1. **Unigram Probability**. How often do I see `word1` and `word2` independently? ![Algo visualizing unigram counts](https://multithreaded.stitchfix.com/assets/posts/2017-10-18-stop-using-word2vec/fig_001.gif)*Example*: This is just a simple word count that fills in the `unigram_counts` array. Then divide `unigram_counts` array by it’s sum to get the probability `p` and get numbers like: `p('facebook')` is 0.001% and `p('lambda')` is 0.000001%
-
-2. **Skipgram Probability**. How often did I see `word1` nearby `word2`? These are called ‘skipgrams’ because we can ‘skip’ up to a few words between `word1` and `word2`.[3](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#3) ![Algo visualizing skipgram counts](https://multithreaded.stitchfix.com/assets/posts/2017-10-18-stop-using-word2vec/fig_002.gif)*Example*: Here we’re counting pairs of words that are near each other, but not necessarily right next to each other. After normalizing the `skipgram_count` array, you’ll get a measurement of word-near-word probabilities like `p('facebook', 'twitter')`. If the value for `p('facebook', 'twitter')` is 10^-9 then out of one billion skipgram tuples you’ll typically see ‘facebook’ and ‘twitter’ once. For another skipgram like `p('morning', 'facebook')` this fraction might be much larger, like 10^-5, simply because the word `morning` is a frequent word.
-
-3. **Normalized Skipgram Probability (or PMI)**. Was the skipgram frequency higher or lower than what we expected from the unigram frequency? Some words are extremely common, some are very rare, so divide the skipgram frequency by the two unigram frequencies. If the result is more than 1.0, then that skipgram occurred more frequently than the unigram probabilities of the two input words and we’ll call the two input words “associated”. The greater the ratio, the more associated. For ratios less than 1.0, the more “anti-associated”. If we take the log of this number it’s called the [pointwise mutual information (PMI)](https://en.wikipedia.org/wiki/Pointwise_mutual_information) of word X and word Y. This is a well-understood measurement in the information theory community and represents how frequently X and Y coincide ‘mutually’ (or jointly) rather than independently. ![Algo visualizing unigram counts](https://multithreaded.stitchfix.com/assets/posts/2017-10-18-stop-using-word2vec/fig_004.gif)*Example 1*: If we look at the association between `facebook` and `twitter` we’ll see that it’s above 1.0: `p('facebook', 'twitter') / p('facebook') / p('twitter') = 1000`. So `facebook` and `twitter` have unusually high-cooccurrence and we infer that they must be associated or similar. Note that we haven’t done any neural network stuff and no math aside from counting & dividing, but we can already measure how associated two words are. Later on we’ll calculate word vectors from this data, and those vectors will be constrained to reproduce these word-to-word relationships. [4](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#4)
-
-   *Example 2*: For `facebook` and `okra` the PMI (`p('facebook', 'okra') / p('facebook') / p('okra')`) is close to 1.0, so `facebook` and `okra` aren’t associated empirically in the data. Later on we’ll form vectors that reconstruct and respect this relationship, and so will have little overlap. But of course the word counts are noisy, and this noise induces spurious associations between words.
-
-4. **PMI Matrix**. Make a big matrix where each row represents word `X` and each column represents word `Y` and each value is the `PMI` we calculated in step 3: `PMI(X, Y) = log (p(x, y) / p(x) / p(y))`. Because we have as many rows and columns as we have words, the size of the matrix is (`n_vocabulary`, `n_vocabulary`). And because `n_vocabulary` is typically 10k-100k, this is a big matrix with lots of zeros, so it’s best to use sparse array data structures to represent the PMI matrix. ![PMI Matrix](https://multithreaded.stitchfix.com/assets/posts/2017-10-18-stop-using-word2vec/fig_005.001.jpeg)
-
-5. **SVD**. Now we reduce the dimensionality of that matrix. This effectively compresses our giant matrix into two smaller matrices. Each of these smaller matrices form a set of word vectors with size (`n_vocabulary`, `n_dim`).[5](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#5) Each row of one matrix represents a single word vector. This is a straightforward operation in any linear algebra library, and in Python it looks like: `U, S, V = scipy.sparse.linalg.svds(PMI, k=256)` ![SVD of PMI Matrix](https://multithreaded.stitchfix.com/assets/posts/2017-10-18-stop-using-word2vec/fig_005.003.jpeg)*Example*. The SVD is one of the most fundamental and beautiful tools you can use in machine learning and is what’s doing most of the magic. Read more about it in [Jeremy Kun’s](https://twitter.com/jeremyjkun?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor) excellent [series](https://jeremykun.com/2016/04/18/singular-value-decomposition-part-1-perspectives-on-linear-algebra/). Here, we can think of it as compressing the original input matrix which (counting zeros) had close to ~10 billion entries (100k x 100k) reduced into two matrices with a total of ~50 million entries (100k x 256 x 2), a 200x reduction in space. The SVD outputs a space that is orthogonal, which is where we get our “linear regularity” and is where we get the property of being able to add and subtract word vectors meaningfully.
-
-6. **Searching**. Each row of the SVD output matrices is a word vector. Once you have these word eigenvectors, you can search for the tokens closest to `consolas`, which is a font popular in programming.
-
-```
-# In psuedocode: 
-# Get the row vector corresponding to the word 'consolas'
-vector_consolas = U['consolas']
-# Get how similar it is to all other words
-similarities = dot(U, vector_consolas)
-# Sort by similarity, and pick the most similar
-most_similar = tokens[argmax(similarities)]
-most_similar
-```
-
-So searching for `consolas` yields `verdana` and `inconsolata` as most similar – which makes sense as these are other fonts. Searching for `Functional Programming` yields `FP`(an acronym), `Haskell` (a popular functional language), and `OOP` (which stands for Object Oriented Programming, an alternative to functional programming). Further, adding and subtracting these vectors and then searching gets word2vec’s hallmark feature: in computing the analogy `Mark Zuckerberg - Facebook + Amazon` we can relate the CEO of Facebook to that of Amazon, which appropriately evaluates to the `Jeff Bezos` token.
-
-It’s a hell of a lot more intuitive & easier to count skipgrams, divide by the word counts to get how ‘associated’ two words are and SVD the result than it is to understand what even a simple neural network is doing.
-
-So stop using the neural network formulation, but still have fun making word vectors!
-
-### Footnotes
-
-1. 1 The approach outlined here isn't exactly equivalent, but it performs about the same as word2vec skipgram negative-sampling [SGNS](https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization). It turns out that word2vec isn't [totally identical to matrix factorization](http://building-babylon.net/2016/05/12/skipgram-isnt-matrix-factorisation/), but for applied purposes (e.g., industry) the SVD technique is good enough. If you care about the differences between Word2vec, [Glove](https://nlp.stanford.edu/projects/glove/) and [Swivel](https://arxiv.org/abs/1602.02215) -- and in industrial purposes I rarely do -- then you'll care about word2vec SGNS vs this SVD formulation. Also, the SVD formulation neatly fits into a family of count-based bag-of-words techniques like TF-IDF, LSI and LDA: ![img](https://multithreaded.stitchfix.com/assets/posts/2017-10-18-stop-using-word2vec/fig_006.png)TF-IDF compares term frequencies to term frequencies within a document, with LSI doing a low-rank factorization of that TF-IDF matrix using the SVD. LDA also counts term frequencies within documents, but instead of SVD factorizes that result using a sparsity-inducing prior in the term and document vectors. That prior, among other things, makes them interpretable. Similarly, the formulation outlined here counts term-to-term associations (no documents) and SVD factorizes those into a low-rank representation. [↩](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#back-1)
-2. 2 A few caveats: if you're doing academic research and spinning off your own embedding systems (for example as in lda2vec [[code\]](https://github.com/cemoody/lda2vec), [[paper\]](https://arxiv.org/abs/1605.02019) [[blog\]](http://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=38&lambda=1&term=)), tweaking the neural network approach can be useful. Also, SVD scales as O(N^3) so it isn't the best where you have large vocabularies N >> 100k. In this case, SGD is nice for online problems. [↩](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#back-2)
-3. 3 In this simplest case, we won't model the effect of distance-weighted skipgrams. We'll just consider skipgrams within a fixed-size moving window around each word. Also note that we aren't regularizing our model, which could benefit from smoothing or forming a prior. Penalizing complexity especially helps in low-data cases, but in my experience this isn't necessary to get decent results on a wide set of real-life corpora with these methods. [↩](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#back-3)
-4. 4 In some formulations, especially in truncated PMI, an important ingredient is the threshold `k` (typically a value like 25.0) which dampens the effect of low-PMI words, ostensibly to get a handle on noise. I interpret the `k` constant as a form of regularization, although it isn't clear to me that this is a disciplined prior. You can see [gauss2vec](https://arxiv.org/abs/1412.6623) for a more careful derivation, and this [paper](https://arxiv.org/abs/1402.3722) for an empirical exploration of this parameter. [↩](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#back-4)
-5. 5 The interpretation is that these word vectors explain the covariance within the PMI matrix -- essentially a low-rank way of saying word X and Y are associated because they co-occur beyond their base rate popularity. Because the axes in the reduced space and are orthogonal it also explains why one can find 'linear regularities' that made the original word2vec famous. For example 'King' and 'Queen' might be seperated along a 'gender' direction but 'London' and 'Berlin' might share a spot in a 'capital' direction. These eigenevectors effectively find a small set of directions in the input space that can still maximially recreate the original large input matrix. [↩](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#back-5)
\ No newline at end of file
--- a/20171015 第13期/Intro To Data Analysis For Everyone! Part 1.md
+++ b/20171015 第13期/Intro To Data Analysis For Everyone! Part 1.md
--- a/20171015 第13期/PyTorch tutorial distilled.md
+++ b/20171015 第13期/PyTorch tutorial distilled.md
--- a/20171015 第13期/README.md
+++ b/20171015 第13期/README.md
--- a/20171015 第13期/The Search for Better Search at Reddit.md
+++ b/20171015 第13期/The Search for Better Search at Reddit.md
--- a/20171016 第14期/A Quick Introduction to Neural Networks.md
+++ b/20171016 第14期/A Quick Introduction to Neural Networks.md
--- a/20171016 第14期/BUILDING A NEURAL NET FROM SCRATCH IN GO.md
+++ b/20171016 第14期/BUILDING A NEURAL NET FROM SCRATCH IN GO.md
--- a/20171016 第14期/Practical Data Science in Python.md
+++ b/20171016 第14期/Practical Data Science in Python.md
--- a/20171016 第14期/README.md
+++ b/20171016 第14期/README.md
--- a/20171016 第14期/神经网络的简单介绍.md
+++ b/20171016 第14期/神经网络的简单介绍.md
--- a/20171017 第15期/MapReduce的模式、算法和用例.md
+++ b/20171017 第15期/MapReduce的模式、算法和用例.md
--- a/20171017 第15期/NoSQL数据库中的分布式算法.md
+++ b/20171017 第15期/NoSQL数据库中的分布式算法.md
--- a/20171017 第15期/README.md
+++ b/20171017 第15期/README.md
--- a/20171017 第15期/软件工程师必备的机器学习技能.md
+++ b/20171017 第15期/软件工程师必备的机器学习技能.md
--- a/20171018 第16期/Benefits of Intel® Optimized Caffe in comparison with BVLC Caffe.md
+++ b/20171018 第16期/Benefits of Intel® Optimized Caffe in comparison with BVLC Caffe.md
--- a/20171018 第16期/NEW OPTIMIZATIONS IMPROVE DEEP LEARNING FRAMEWORKS FOR CPUS.md
+++ b/20171018 第16期/NEW OPTIMIZATIONS IMPROVE DEEP LEARNING FRAMEWORKS FOR CPUS.md
--- a/20171018 第16期/README.md
+++ b/20171018 第16期/README.md
--- a/20171018 第16期/Why SQL is beating NoSQL, and what this means for the future of data.md
+++ b/20171018 第16期/Why SQL is beating NoSQL, and what this means for the future of data.md
--- a/20171019 第17期/README.md
+++ b/20171019 第17期/README.md
--- a/20171019 第17期/TensorFlow在现代英特尔体系结构下的优化.md
+++ b/20171019 第17期/TensorFlow在现代英特尔体系结构下的优化.md
--- a/20171019 第17期/一个用Kears和 OpenAi Gym实现深度Q网络的Gotchas指南.md
+++ b/20171019 第17期/一个用Kears和 OpenAi Gym实现深度Q网络的Gotchas指南.md
--- a/20171019 第17期/科学家可以读取一只鸟的大脑并且预测它的下一声啼叫.md
+++ b/20171019 第17期/科学家可以读取一只鸟的大脑并且预测它的下一声啼叫.md
--- a/20171020 第18期/AlphaGoZero从零开始学习.md
+++ b/20171020 第18期/AlphaGoZero从零开始学习.md
--- a/2017年10月/20171020 第18期/Perceptually Uniform Color Spaces.md
+++ b/2017年10月/20171020 第18期/Perceptually Uniform Color Spaces.md
--- a/20171020 第18期/README.md
+++ b/20171020 第18期/README.md
--- a/2017年10月/20171020 第18期/Stop Using Word2vec.md
+++ b/2017年10月/20171020 第18期/Stop Using Word2vec.md
+# Stop Using word2vec
+
+原文链接：[Stop Using word2vec](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
+
+四年前，当我开始使用word2vec时，我需要（幸运的是）十分长的时间来计算。但是由于我们对word2vec的理解有所进步，现在只需要15分钟就能在一台普通的计算机上使用标准数值库[1](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#1)计算单词向量。单词向量是[awesome](http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/)但你不需要神经网络 - 并且肯定不需要深入学习 - 找到它们[2](http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/) 。因此，如果您正在使用单词向量并且没有针对最新技术或纸质出版物进行贡献，那么*停止使用word2vec。*
+
+当我们完成后，你将衡量单词的相似性：
+
+```
+facebook ~ twitter, google, ...
+```
+
+… 和经典的单词矢量操作： `zuckerberg - facebook + microsoft ~  nadella`
+
+…但你会主要通过计算单词和分割来做到这一点，在构造中梯度没有起任何作用！
+
+## 秘籍
+
+让我们假设您已经掌握了[hacker-news语料库](https://cloud.google.com/bigquery/public-data/hacker-news)并清理并标记了它（或下载了预处理版本[此处](https://zenodo.org/record/49899)）。步骤如下：
+
+1. **Unigram概率**。`word1`和`word2`的频率是？
+
+![Algo visualizing unigram counts](https://multithreaded.stitchfix.com/assets/posts/2017-10-18-stop-using-word2vec/fig_001.gif)
+
+*示例*：这只是一个填充`unigram_counts`数组的简单词。然后将`unigram_counts`数组除以它们之和得到概率`p`并得到如下数字：`p('facebook')`是0.001％而`p('lambda')`是0.000001％
+
+2. **Skipgram概率**。在`word2`附近看到`word1`的频率？这些被称为'skipgrams'，因为我们可以“跳过”`word1`和`word2`之间的几个单词。[3](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#3)
+
+![Algo visualizing skipgram counts](https://multithreaded.stitchfix.com/assets/posts/2017-10-18-stop-using-word2vec/fig_002.gif)
+
+*示例*：这里我们计算相互附近的单词对，但不一定是在相互邻近的单词。规范化`skipgram_count`数组后，您将得到像`p('facebook'，'twitter')`这样的word-near-word概率。如果`p('facebook'，'twitter')`的值是10^-9那么在10亿个skipgram元组中你通常会看到'facebook'和'twitter'一次。对于另一个像`p('morning'，'facebook')`这样的单词对，这个分数可能会大得多，比如10 ^ -5，因为单词'morning`是一个常用词。
+
+3. **标准化的Skipgram概率（或PMI）**。skipgram频率是否高于或低于我们对单字频率的预期？有些词是非常常见的，有些是非常罕见的，所以将skipgram频率除以两个单字组频率。如果结果大于1.0，则该skipgram发生的频率高于两个输入词的单字组概率，我们将这两个输入字称为“关联”。比率越大，关联性越强。对于小于1.0的比率，“反关联”越多。如果我们记录这个数字的日志，它被称为单词X和单词Y的[逐点互信息（PMI）](https://en.wikipedia.org/wiki/Pointwise_mutual_information)。这是一个很好理解的测量信息理论社区并代表X和Y多次“相互”（或联合）而不是独立地。
+
+![Algo visualizing unigram counts](https://multithreaded.stitchfix.com/assets/posts/2017-10-18-stop-using-word2vec/fig_004.gif)
+
+*示例1*：如果我们看看`facebook`和`twitter`之间的关联，我们会看到它高于1.0：`p('facebook'，'twitter')/ p('facebook')/ p(' twitter')= 1000`。所以`facebook`和`twitter`异常高度共存，我们推断它们必须是相关或类似的。请注意，除了计算和分割之外，我们还没有做任何神经网络的东西或数学计算，但我们已经可以测量两个单词的相关性。稍后我们将根据这些数据计算单词向量，并且这些向量将被约束以重现这些单词到单词的关系。 [4](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#4)
+
+*示例2*：对于`facebook`和`okra`，PMI(`p('facebook'，'okra')/ p('facebook')/ p('okra')`)接近1.0，所以`facebook`和`okra`在数据中没有依赖关系。稍后我们将形成重建和遵循这种关系的向量，因此几乎没有重叠。但当然，单词数量依旧含有噪声，这种噪声会导致字之间的虚假关联。
+
+4. **PMI矩阵**。制作一个大矩阵，其中每一行代表单词“X”，每列代表单词“Y”，每个值是我们在步骤3中计算的“PMI”：`PMI(X，Y)= log(p(x，y) / p(x)/ p(y))`。因为我们拥有与单词一样多的行和列，所以矩阵的大小是(`n_vocabulary`，`n_vocabulary`)。因为`n_vocabulary`通常是10k-100k，这是一个有很多零的大矩阵，所以最好使用稀疏数组数据结构来表示PMI矩阵。 
+
+![PMI矩阵](https://multithreaded.stitchfix.com/assets/posts/2017-10-18-stop-using-word2vec/fig_005.001.jpeg)
+
+5. **SVD**。现在我们减少该矩阵的维数。这有效地将我们的巨型矩阵压缩成两个较小的矩阵。这些较小的矩阵中的每一个都形成一组具有大小的单词向量(`n_vocabulary`，`n_dim`)。[5](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#5)一个矩阵的每一行代表一个单词矢量。这是任何线性代数库中的直接操作，在Python中它看起来像：`U, S, V = scipy.sparse.linalg.svds(PMI, k=256)`
+
+![SVD of PMI Matrix](https://multithreaded.stitchfix.com/assets/posts/2017-10-18-stop-using-word2vec/fig_005.003.jpeg)
+
+*Example* SVD是您可以在机器学习中使用的最基本和最棒的工具之一，它正在做大部分的魔法。在[Jeremy Kun](https://twitter.com/jeremyjkun?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor) 很棒的 [系列](https://jeremykun.com/2016/04/18/singular-value-decomposition-part-1-perspectives-on-linear-algebra/)。在这里，我们可以将其视为压缩原始输入矩阵（计数零）接近~100亿个条目（100k x 100k），缩减为两个矩阵，总共约5000万个条目（100k x 256 x 2），空间减少200倍。 SVD输出一个正交的空间，这是我们获得“线性规律性”的地方，也是我们能够有意义地添加和减去单词向量的地方。
+
+6. **搜索**。 SVD输出矩阵的每一行是字向量。一旦你有了这些单词的奇异向量，你就可以搜索最接近`consolas`的标记，这是一种在编程中很流行的字体。
+
+
+```
+# In psuedocode: 
+# Get the row vector corresponding to the word 'consolas'
+vector_consolas = U['consolas']
+# Get how similar it is to all other words
+similarities = dot(U, vector_consolas)
+# Sort by similarity, and pick the most similar
+most_similar = tokens[argmax(similarities)]
+most_similar
+```
+
+所以搜索`consolas`会产生`verdana`和`inconsolata`最相似 - 这是有意义的，因为这些是其他字体。搜索“功能编程”产生`FP`（首字母缩略词），`Haskell`（一种流行的函数式语言）和`OOP`（代表面向对象编程，是函数式编程的替代）。此外，添加和减去这些向量，然后搜索获得word2vec的标志性特征：在计算类比`Mark Zuckerberg - Facebook + Amazon`时，我们可以将Facebook的首席执行官与亚马逊的首席执行官联系起来，后者适当地评估`Jeff Bezos`词汇。
+
+这是一个更直观，更容易计算跳过的大坑，除了单词计数以获得两个单词的“关联”和SVD结果，而不是理解最简单的神经网络正在做什么。
+
+所以停止使用神经网络，但依旧有趣地构造单词向量！
+
+### 脚注
+
+1. 这里概述的方法并不完全等效，但它的表现与word2vec skipgram负采样大致相同 [SGNS](https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization). 事实证明，word2vec不是[与矩阵分解完全相同](http://building-babylon.net/2016/05/12/skipgram-isnt-matrix-factorisation/)， 但是对于应用目的（例如工业上），SVD技术足够好。如果你关心Word2vec之间的差异， [Glove](https://nlp.stanford.edu/projects/glove/) 和 [Swivel](https://arxiv.org/abs/1602.02215) -- 并且在工业方面，我很少这样做 -- 然后你会关心word2vec SGNS vs这个SVD公式。此外，SVD方法巧妙地融入了一系列基于计数的词袋技术，如TF-IDF，LSI和LDA：
+
+![img](https://multithreaded.stitchfix.com/assets/posts/2017-10-18-stop-using-word2vec/fig_006.png)
+
+TF-IDF将术语频率与文档中的术语频率进行比较，LSI使用SVD对该TF-IDF矩阵进行低秩分解。 LDA还对文档中的术语频率进行计数，但不是SVD，而是使用术语和文档向量中的稀疏性先验导致的结果。 除此之外，之前的那些使它们具有可解释性。 同样，此处简要的表述计算了术语与术语的关联（无文档），SVD将这些关联分解为低秩表示。
+ [↩](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#back-1)
+
+2. 一些警告：如果您正在进行学术研究并剥离自己的嵌入系统（例如在lda2vec中） [[code\]](https://github.com/cemoody/lda2vec), [[paper\]](https://arxiv.org/abs/1605.02019) [[blog\]](http://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/#topic=38&lambda=1&term=)), 调整神经网络方法可能很有用。此外，SVD时间复杂度为O（N^3），因此它不是最好的。当N >> 100k。在这种情况下，SGD很适合在线问题。 [↩](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#back-2)
+
+3. 在这个最简单的情况下，我们不会模拟距离加权的跳跃图的效果。我们只考虑在每个单词周围的固定大小的移动窗口内的skipgrams。还要注意，我们没有规范我们的模型，这可以从平滑或形成先验中受益。惩罚复杂性特别有助于小数据情况，但根据我的经验，这对于使用这些方法在广泛的真实语料库中获得适当结果不是必要的。 [↩](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#back-3)
+
+4. 在一些方法中，特别是在截短的PMI中，一个重要的成分是阈值`k`（通常是25.0的值），它可以抑制低PMI词的影响，表面上是为了处理噪音。我将`k`常数解释为正则化的一种形式，尽管我不清楚这是一个规范的先验。你可以查阅 [gauss2vec](https://arxiv.org/abs/1412.6623) 得到更详细地证明, 并且这篇文章 [paper](https://arxiv.org/abs/1402.3722) 对参数的进行了实际的探索。 [↩](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#back-4)
+
+5. 解释是这些单词向量解释了PMI矩阵内的协方差 - 本质上是一种低级别的方式，说单词X和Y是相关联的，因为它们共同发生超出其基本速率流行度。因为缩小空间中的轴并且是正交的，这也解释了为什么人们可以找到使原始word2ve着名的“线性规则”。例如，“国王”和“女王”可能会沿着“性别”方向分开，但“伦敦”和“柏林”可能会在“首都”方向上分享一个位置。这些特征向量有效地在输入空间中找到一小组方向，仍然可以最大限度地重建原始的大输入矩阵。 [↩](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com#back-5)
\ No newline at end of file