diff --git "a/20171015 \347\254\25413\346\234\237/Intro To Data Analysis For Everyone! Part 1.md" "b/20171015 \347\254\25413\346\234\237/Intro To Data Analysis For Everyone! Part 1.md" index 8c5880d8616a5546ecd1841e31081aab55fc28dc..51ff9b0a12d28586cb5ecacf6cc8ac4f6f8a88b5 100644 --- "a/20171015 \347\254\25413\346\234\237/Intro To Data Analysis For Everyone! Part 1.md" +++ "b/20171015 \347\254\25413\346\234\237/Intro To Data Analysis For Everyone! Part 1.md" @@ -1,168 +1,166 @@ -# Intro To Data Analysis For Everyone! Part 1 +# 每个人的数据分析! Part 1 原文链接:[Intro To Data Analysis For Everyone! Part 1](https://towardsdatascience.com/intro-to-data-analysis-for-everyone-part-1-ff252c3a38b5?from=hackcv&hmsr=hackcv.com) -Data analysis is part of any data scientists daily work ([along with data munging and cleansing](https://www.thoughtworks.com/insights/blog/let-data-scientists-be-data-mungers)). It is also very important for a good portion of everyone else in the modern workforce. This could be a systems analysts, business owners, financial teams, and project managers. +数据分析是任何数据科学家日常工作的一部分([以及数据篡改和清理](https://www.thoughtworks.com/insights/blog/let-data-scientists-be-data-mungers))。这对现代劳动力中其他大部分人来说也是非常重要的。可以是系统分析师、业务所有者、财务团队和项目经理。 -However, most undergrad courses do not ([or at the very least did not](https://www.coursera.org/browse/data-science/data-analysis?languages=en)) teach the basics of data analysis in any of their courses. There were math courses, and statistics, as well as plenty of computer programming courses that involved data structures and algorithms. +然而,大多数本科课程并没有[或至少没有](https://www.coursera.org/browse/datscience/datanalysis?)数据分析,而是有数学和统计学的课程,还有大量涉及数据结构和算法的计算机编程课程。 -Yet, none of these focused on how to look at data sets from databases, csvs or the dozens of other data sources that exist in the modern data world. +然而,这些都没有关注如何查看来自数据库、csvs或现代数据世界中存在的数十个其他数据源的数据集。 -There might be the occasional project that requires analyzing data. Some individuals might have been lucky enough to receive a set of projects that forced them to analyze data for the first time out of a database. However, most students are left to attempt to figure it out themselves during their first job! +可能偶尔会有需要分析数据的项目。有些人可能很幸运地收到了一组项目,迫使他们第一次从数据库中分析数据。然而,大多数学生都在他们的第一份工作中试图自己解决这个问题。 -For students not planning to be programmers, u[nderstanding databases and SQL is a super-valuable skill](http://www.skilledup.com/articles/learn-sql-it-most-in-demand-skill-in-single-day). It allows them access to data that was once held hostage by database teams. +对于不打算成为程序员的学生来说, [理解数据库和SQL是一项非常有价值的技能](http://www.skilledup.com/articles/learn-sql-it-most-in-demand-skill-in-single-day),这样可以让他们理解那些已经被数据库团队分析后的数据。 -Managers are no longer ok with their teams not having access to data! Thus, even a marketing major needs to know how to work with and devise analysis from data! +管理人员不再能接受他们的团队看不懂数据,或不知道如何进行数据分析!因此,即使是营销专业的学生也需要知道如何使用和设计数据分析! -Data analysis is abstract. It is not math(although math is involved), it is not english or accounting. It requires a hands on approach in order to truly understand the pitfalls good analysts will run into. Yet, most students have not had to deal with vague parameters, and large data sets by the time they get into their first job, which is shame! Many students haven’t even heard of a data warehouse, and this is where most of the data that helps managers make critical decisions reside. +数据分析是抽象的。它不是数学(虽然涉及数学),也不是英语或会计。要真正理解优秀分析师会遇到的陷阱,就需要有实际的方法。然而,很遗憾的是大多数学生在进入第一份工作时,还不需要处理模糊的参数和庞大的数据集,许多学生甚至没有听说过数据仓库,然后这正是帮助管理者做出关键决策的大部分数据所在之处。 -In the modern business world, data analysis is not limited to data scientists. It is also key for analysts, systems engineers, financial teams, PR, HR, marketing, and so on. +在现代商业世界中,数据分析并不局限于数据科学家。对于分析师、系统工程师、金融团队、公关、人力资源、营销等等来说,这也是很重要的技能。 -Thus, our team wanted to give a guide to helping both new students and those interested in learning more about data science and analysis. +因此,我们的团队想提供一个指南,帮助新学生和那些有兴趣学习更多数据科学和分析的人。 -### The Foundation Of Good Data Science And Analytics +### 良好数据科学和分析的基础 -This first part in this series will cover the important soft skills required for good analysis. [Data analysis is not only math, SQL and scripting](https://www.theseattledataguy.com/statistics-data-scientist-review/). It is also about staying organized and being able to clearly articulate to managers the discoveries that have been unearthed. This is one of many traits that [successful teams in data science and analytics portray](https://www.theseattledataguy.com/top-30-tips-data-science-team-succeeds/). We believe it is important to point these out first because it lays the groundwork for our next few parts. +本系列的第一部分将介绍良好分析所需的重要软技能。 [数据分析不仅仅是数学、SQL和脚本](https://www.theseattledataguy.com/statistics-data-scientist-review/)。它还包括保持组织有序,能够清晰地向管理者阐明已经发现的发现。这是[成功的数据科学和分析团队所描绘](https://www.theseattledataguy.com/top-30-tips-data-science-team-succeeds/)的众多特征之一。我们认为首先指出这些是很重要的,因为它为我们接下来的几个部分奠定了基础。 -After this section, we will discuss analysis processes, techniques and give examples with data sets, SQL and python notebooks. +在本节之后,我们将讨论分析过程、技术,并通过数据集、SQL和python笔记给出示例。 -**Communication** +**沟通** -[The term data storyteller has become correlated with data scientist](https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen), but it is also important for anyone who uses data to be good at communicating their findings! +[术语数据讲故事者已经与数据科学家联系在一起](https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen),但对于使用数据的人来说,擅长传达他们的发现也很重要! -This skill-subset fits in the general skill of communication. Data scientists have access to multiple data sources from various departments. This gives them the responsibility and need to be able to clearly explain what they are discovering to executives and SMEs in multiple fields. They take complex mathematical and technological concepts and create clear and concise messages that executives can act upon. Not just hiding behind their jargon, but actually transcribing their complex ideas into business speak. Analysts and data scientists alike must be able to take numbers and return clearly stated ROIs and actionable decisions. +这种技能子集符合通信的一般技能。数据科学家可以访问来自不同部门的多个数据源。这使他们有责任并且需要能够清楚地解释他们在多个领域向高管和中小企业发现的内容。他们采用复杂的数学和技术概念,创建清晰简洁的信息,管理人员可以采取行动。不只是躲在他们的行话背后,而是将他们复杂的想法转化为商业话语。分析师和数据科学家都必须能够获取数字并返回明确规定的投资回报率和可行的决策。 -This means not only taking good notes and creating solid work books. It also means creating solid reports and walk throughs for other teams. +这意味着不仅要记笔记,还要创造扎实的工作簿。它还意味着为其他团队创建可靠的报告和遍历。 -How do you do that?(this could be a post in itself), but here are some quick tips to better communicate your ideas in a report or presentation. +这是如何做到的?(这可能是一个帖子本身),但这里有一些快速提示,可以更好地在报告或演示文稿中传达你的想法。 -1. Label every figure, axis, data point, etc -2. Create a natural flow of data and notes in a note book -3. Make sure to highlight your key findings! Don’t bury the lead, sell your big conclusion! This is easier said than done when you have lots of data to prove your point. -4. Imagine you are actually telling a story or writing an essay with data -5. Don’t bore your audience to death, keep it sweet and succinct -6. Avoid heavy math jargon! If you can’t explain your calculations in plain English, you don’t understand them -7. Peer review your reports and presentations to ensure for maximum clarity + 1. 标记每个图形,轴,数据点等 + 2. 在笔记本中创建自然的数据和笔记流 + 3. 确保突出您的主要发现!不要藏起来,把你的结论展示出来!使用大量数据证明您的观点时,这说起来容易做起来难。 + 4. 想象一下,你实际上在讲故事或写一篇有关数据的文章 + 5. 不要让观众觉得枯燥,保持甜美和简洁 + 6. 避免繁重的数学术语!如果你不能用简单的英语解释你的计算,你就没有完全理解。 + 7. 让同行审核您的报告和演示文稿,以确保最大程度的清晰度 -**One Of Our Favorite Examples Of Data Story Telling!** +**我们最喜欢的数据故事之一!** -**Empathetic Listening** +**善于倾听** -Data scientists and analysts aren’t always on the same team as the business owners, and managers that come to them with questions. This makes it very important for analysts to listen diligently to what is actually being asked of them. +数据科学家和分析师并不总是与企业主和管理人员在同一个团队中提出问题。这使得分析师非常重视聆听实际被问到的内容。 -Working in large corporations, there is a lot of value in trying to seek out other teams pain points and problems and help them through it! This means having empathy. Part of this skill requires experience in the workforce and other parts of this skill simply require understanding other human beings. +在大公司工作,试图寻找其他团队的痛点和问题并帮助他们度过难关是很有价值的!这意味着要有同理心。这项技能的一部分需要劳动力的经验,而这项技能的其他部分只需要了解其他人。 -Why are they really asking for the analysis and how can you make it as clear and accurate for them as possible? +为什么他们真的要求进行分析?你如何使分析尽可能清晰准确? -Miscommunication with the business owners can happen quiet easily. Thus, the combination of [listening diligently as well as listening for what is not being said is a great asset.](https://www.forbes.com/sites/glennllopis/2013/05/20/6-effective-ways-listening-can-make-you-a-better-leader/#3fafb2421756) +与企业主沟通不畅很容易发生。因此 [认真倾听和倾听言外之意是一项很棒的技能](https://www.forbes.com/sites/glennllopis/2013/05/20/6-effective-ways-listening-can-make-you-a-better-leader/#3fafb2421756)。 ![img](https://cdn-images-1.medium.com/max/1000/0*x4gXpuM1k7rgyHi9.) -**Context Focused** +**关注背景** -Besides being focused on details. Data analysts and data scientists also need to focus on what context is behind the data they are analyzing. This means understanding the needs of the other departments who have requested the project as well as actually understanding the processes behind the data they are analyzing. +除了关注细节。数据分析师和数据科学家还需要关注他们分析的数据背后的背景。这意味着理解请求项目的其他部门的需求,以及实际理解他们分析的数据背后的过程。 -Data typically represents the processes of a business. This could be a user interacting with a ecommerce site, a patient in a hospital, a project getting approved, software being purchased and invoiced and so on. +数据通常表示业务的流程。这可能是一个用户与电子商务网站交互,一个病人在医院,一个项目获得批准,软件被购买和开发等等。 -All of these get represented in thousands of data warehouses and databases across the world and all of them are often stored just slightly differently with different business rules. +这意味着,数据分析师需要理解这些业务规则和逻辑!否则,他们就无法进行良好的分析,他们会做出错误的假设,并且常常会创建脏的、重复的数据。 -That means, data analysts need to understand those business rules and logic! Otherwise, they can’t perform good analysis, they will make bad assumptions and they will often create dirty and duplicate data. +这都是因为他们不理解使用的场景。上下文允许以数据为中心的团队更清楚地做出假设。他们不需要在假设阶段花太多时间去检验所有可能的理论。相反,他们可以利用上下文来帮助加速分析的过程。 -All because they did not understand context. Context allows data focused teams to make assumptions more clearly. They are not forced to spend too much time in the hypothesis phase where they are testing every possible theory. Instead, they can utilize context to help speed up the process of their analysis. +数据周围的元数据(例如上下文)对于数据科学家来说就像黄金。它并不总是在那里,但当它在的时候。它使我们的工作更容易! -The metadata (e.g. context) around data, is like gold to a data scientists. It isn’t always there, but when it is. It makes our jobs much easier! +**记录能力** -**Note Taking Prowess** +[无论是用excel还是Jupyter笔记本](http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html).。对于数据分析师来说,了解如何跟踪他们的工作是很重要的! -[Whether using excel or Jupyter notebook](http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html). It is important for a data analyst to understand how to track their work! +分析需要大量的假设和问题,如果没有记录下来,就会失去思路。 -Analysis requires a lot of assumptions and questions, and single track thinking that can be lost if not noted down. +第二天回来时,很容易忘记分析了什么,不同的查询和指标是如何以及为什么被提取的,等等。因此,以一种勤奋的方式记录下每一件事情是很重要的。这个技巧是不能留给第二天的,因为总是会有信息丢失! -It is easy to come back the next day and forget what was analyzed, how and why different queries and metrics were pulled, etc. Thus, it is important to note everything down in a diligent manner. This skill is not to be left to the next day, because there will always be loss of information! +创建一个清晰的记录方式使每个人都更容易参与。我们在之前的交流中提到过。然而,一次。 -Creating a clear note taking style makes it easier for everyone involved. We brought this up earlier in communication. However, again. +标签,创造自然流通的笔记,避免商业术语可以帮助每个人参与,包括当初的记录人!当记录者都不理解自己的笔记,这将是相当尴尬的一件事。 -Labeling, creating a natural flow of notes, and avoiding business jargon can help everyone involved. Even the original note taker! It is pretty embarrassing when even the original note taker does not understand their notes! +记笔记很重要! -Note taking saves lives! +**创造性和抽象思维** -**Creative and Abstract Thinking** +创造力和 [抽象思维 ](http://www.projectlearnet.org/tutorials/concrete_vs_abstract_thinking.html)有助于数据科学家更好地假设他们在最初探索阶段看到的可能模式和特征。将逻辑思维与最小的数据点结合起来,数据科学家可以得出几种可能的解决方案。然而,这需要跳出框框进行思考。 -Creativity and [abstract thinking ](http://www.projectlearnet.org/tutorials/concrete_vs_abstract_thinking.html)helps data scientists better hypothesize possible patterns and features they are seeing in their initial exploration phases. Combining logical thinking with minimal data points, data scientists can lead themselves to several possible solutions. However, this requires thinking outside of the box. +分析是有纪律的研究和创造性思维的结合。如果分析师受到确认偏差或过程的限制,那么他们可能无法得出正确的结论。 -Analysis is a combination of disciplined research and creative thinking. If an analysts is too limited by confirmation bias or process, then they might not reach the correct conclusions. - -If, on the other hand, they are too wildly thinking, and not using basic deduction and induction to drive their search. They could spend weeks trying to answer a simple question as they wander through various data sets without any real clear cut goal. +另一方面,如果他们过于疯狂地思考,没有使用基本的推论和归纳来驱动他们的搜索。在浏览各种数据集时,他们可能会花上数周时间试图回答一个简单的问题,而没有任何明确的目标。 **Engineering Mindset** -Analysts need to be able to take large problems and data sets and break them down into smaller pieces. Sometimes, the 2–3 questions asked by a separate team can’t be answered by 2–3 answers. +分析师需要能够将大问题和数据集分解成更小的部分。有时候,一个单独的团队提出的2-3个问题无法用2-3个答案来回答。 -Instead, the 2–3 questions themselves might need to be broken down into small bite size questions that can be analyzed and supported by data. +相反,2-3个问题本身可能需要被分解成小问题,这些问题可以被数据分析和支持。 -Only then, can the analyst go back and answer the larger questions. Especially with large and complex data sets. It is becoming more and more important to be[ able to clearly breakdown analysis into its proper pieces](http://www.thwink.org/sustain/articles/000_AnalyticalApproach/index.htm). +只有这样,分析师才能回去回答更大的问题。特别是对于大而复杂的数据集。[能够清楚地将分析分解成适当的部分](http://www.thwink.org/sustain/articles/000_AnalyticalApproach/index.htm).变得越来越重要。 -**Attention To Details** +**注意细节** -Analysis requires attention to details. Just because an analyst or data scientist might be a big picture person. This does not mean they are not responsible for figuring out all the valuable details that surround a project. +分析需要注意细节。仅仅因为一个分析师或数据科学家可能是一个大局观的人。这并不意味着他们不负责找出围绕项目的所有有价值的细节。 -Companies, even small ones have lots of nooks and crannies. There are processes on processes and not understanding those processes and their details affects the level of analysis that can be done. +公司,甚至是小公司都有很多角落和缝隙。流程上有流程,但不理解这些流程及其细节会影响可执行的分析级别。 -Especially when writing complex queries and programming scripts. It is very easy to incorrectly join a table or filter the wrong thing. Thus, it is key to always double and triple check work(also, if scripts are involved, peer reviews should be too!). +特别是在编写复杂的查询和编程脚本时。很容易不正确地连接表或过滤错误的东西。因此,总是进行两次和三次检查工作是非常关键的(而且,如果涉及脚本,同行评审也应该如此!) ![img](https://cdn-images-1.medium.com/max/1000/0*R-Rff55mXIQkP7mJ.) -**Curiosity** +**好奇心** -Analysis requires curiosity. We will get into this when we break down the process. However, a step in the analysis process is listing out all the questions you believe are valuable to the analysis. This requires a curious mind that cares to know the answer. +分析需要的好奇心。当我们分解这个过程时,我们会讲到这个。然而,分析过程中的一个步骤是列出所有您认为对分析有价值的问题。这需要一个好奇的心去关心答案。 -Why is the data the way it is, why are we seeing patterns, what can we use to find the answer, and Who would know? +为什么数据是这样,为什么我们看到模式,我们能用什么来找到答案,谁知道呢? -These are just some vague questions that can help start pointing analysis in the right direction. [There needs to be that drive and desire to know why!](http://www.ibmbigdatahub.com/podcast/curious-data-scientist) +这些只是一些模糊的问题,可以帮助我们开始向正确的方向进行分析。 [需要有那种动力和欲望去知道为什么!](http://www.ibmbigdatahub.com/podcast/curious-data-scientist) -**Tolerance of Failure** +**宽容失败** -Data science has a lot of similarities to the science field. In the sense that there might be 99 failed hypotheses that lead to 1 successful solution. Some data driven companies only expect their machine learning engineers and data scientists to create new algorithms, or correlations every year to year and a half. This depends on the size of the task and the type of implementation required (e.g. process implementation, technical, policy, etc). In all of this work there is failure after failure, there is unanswered question after unanswered question and analysts have to continue. +数据科学与科学领域有许多相似之处。从这个意义上说,可能有99个失败的假设导致1个成功的解决方案。一些数据驱动型的公司只希望他们的机器学习工程师和数据科学家每年创造新的算法,或者每年半的相关性。这取决于任务的大小和所需的实现类型(例如流程实现、技术、策略等)。在所有这些工作中都有失败后的失败,有未回答的问题后的问题和分析师不得不继续。 -The point is to get the answer, or clearly state why you can’t answer the question. However, it can’t just be giving up because the first few attempts failed. +关键是要得到答案,或者清楚地说明为什么你不能回答这个问题。然而,它不能仅仅因为最初的几次尝试失败而放弃。 -Analysis can be a black hole for time. Question after question can be incorrect. That is why it is important to have a semi-structured process. One that guides analysts but doesn’t keep them back. +分析可以成为时间的黑洞。一个接一个的问题可能是不正确的。这就是为什么半结构化过程很重要。它可以指导分析师,但不会阻止他们。 -### **Data Science and Analytics Soft Skills** +### **数据科学和分析软技能** -These skills analysts and data scientists need aren’t all about programming and statistical analysis. Instead, these skills are about focusing on making sure the the insights that are discovered are easily transferable. This allows other team members and managers to also gain from the analysis done! +这些技能分析人员和数据科学家需要的不仅仅是编程和统计分析。相反,这些技巧的重点在于确保所发现的洞见是易于转移的。这允许其他团队成员和经理也从分析中获益! -Analysts need to be able to do more than just come to a conclusion. They need to be able to create work that is easily reproducible and communicable. +分析师需要做的不仅仅是得出结论。他们需要能够创造出易于复制和传播的工作。 -**Why?** +**为什么?** -It not only saves time! +它不仅节省时间! -It more importantly helps leadership trust the analyst’s conclusion. Otherwise, the analysts might be correct, but if he or she sounds unconfident, if they have bad notes, or are even missing one data point. It can instantly lead to distrust among leadership! +更重要的是,这有助于领导信任分析师的结论。否则,分析师可能是对的,但如果他或她听起来不自信,如果他们记错了笔记,甚至漏掉了一个数据点。这会立即导致领导层之间的不信任! -Sadly this is very true! Analysts work can instantly come into question when even just one data point is incorrect or communicated poorly. We often recommended that data teams do a walk through of their reports and presentations just to check for holes. Having a team member that is good at questioning every angle is great in these situations! +不幸的是,这是真的!当仅仅一个数据点不正确或沟通不畅时,分析师的工作就会立即受到质疑。我们经常建议数据团队检查他们的报告和演示文稿,检查漏洞。在这种情况下,有一个善于质疑每个角度的团队成员是很好的! -The more your team can pre-answer questions executives may have. The more likely the executives will sign off on the next leg of the project! +你的团队可以提前回答高管可能提出的问题越多。高管们更有可能在项目的下一阶段签字! ![img](https://cdn-images-1.medium.com/max/1000/0*J7W2YgdjexxKsr4X.) -**The Process of Data Analysis** +**数据分析的过程** -In the next portion we will lay out a process for analyzing data. We will be setting up basic notebooks and describing simple processes that will help new and experienced data scientists and analysts make sure they are tracking their work effectively. +在下一部分中,我们将介绍分析数据的过程。我们将建立基本的笔记和描述简单的过程,这将帮助新的和有经验的数据科学家和分析师确保他们有效地跟踪他们的工作。 -### [Part 2 Of Data Analysis For Everyone](https://medium.com/@SeattleDataGuy/data-analysis-for-everyone-part-2-cf1c79441940) +### [Part 2 每个人的数据分析](https://medium.com/@SeattleDataGuy/data-analysis-for-everyone-part-2-cf1c79441940) -**Other Resources About Data Science And Strategy** +**其他关于数据科学和策略的资源** [How To Apply Data Science To Real World Problems](https://www.theseattledataguy.com/data-science-case-studies/) diff --git "a/20171015 \347\254\25413\346\234\237/PyTorch tutorial distilled.md" "b/20171015 \347\254\25413\346\234\237/PyTorch tutorial distilled.md" index d314a96b52945d86542be2ff19642a6f0bf6d798..13df31c62b53047b486c664ded3c9e64564fda06 100644 --- "a/20171015 \347\254\25413\346\234\237/PyTorch tutorial distilled.md" +++ "b/20171015 \347\254\25413\346\234\237/PyTorch tutorial distilled.md" @@ -1,46 +1,44 @@ -# PyTorch tutorial distilled +# PyTorch教程 原文链接:[PyTorch tutorial distilled](https://towardsdatascience.com/pytorch-tutorial-distilled-95ce8781a89c?from=hackcv&hmsr=hackcv.com) -## Migrating from TensorFlow to PyTorch +## 从 TensorFlow 转向了 PyTorch ![img](https://cdn-images-1.medium.com/max/2000/1*aqNgmfyBIStLrf9k7d9cng.jpeg) -When I first started study PyTorch, I drop it after a few days. It was hard for me to get core concepts of this framework comparing with the TensorFlow. That’s why I’ve put it on my “knowledge bookshelf” and forgot about it. But not so far ago a new version of PyTorch was released. So I’ve decided to give it a chance again. After a while, I understood that this framework is really easy to use and it makes me happy to code in PyTorch. In this post, I will try to explain core concepts of it clearly so that you will be motivated at least give it a try right now, not after a few years or more. We will cover some basic principles and some advanced stuff as learning rate schedulers, custom layers and more. +在我第一次开始学习PyTorch时候,过了几天我就放弃了,对我来说理解这个框架的核心概念和TensorFlow比起来太难了。这就是为什么我把它放在了我的“知识书架”上,渐渐的遗忘了它。但是不久之后,PyTorch的新版本的发布了,我决定再尝试一次。过了一会,我意识到这个框架简便易行,让我很开心的使用PyTorch来编程。我会尝试清楚地解释它的核心概念,这样你就会有动力,至少现在试一试,而不是几年或更长时间。我们将介绍一些基本原则和一些高级内容,如学习速率调度程序,自定义层等。 -#### Resources +#### 学习资料 -First that you should know about PyTorch it that [documentation](http://pytorch.org/docs/master/) and [tutorials](http://pytorch.org/tutorials/)are stored separately. Also sometimes they may don’t meet each other, because of fast development and version changes. So fill free to investigate [source code](http://pytorch.org/tutorials/). It’s very clear and straightforward. And it’s better to mention that there are exist awesome [PyTorch forums](https://discuss.pytorch.org/), where you may ask any appropriate question, and you will get an answer relatively fast. This place seems to be even more popular than StackOverflow for the PyTorch users. +首先,你应该了解PyTorch, [文档](http://pytorch.org/docs/master/) 和 [教程](http://pytorch.org/tutorials/)是分开存储的。因为更新的太快了,所有他们可能有部分会不一样,所以请查阅 [源代码](http://pytorch.org/tutorials/),这就非常明确和直截了当。而且,还有一个很棒的[PyTorch论坛](https://discuss.pytorch.org/),在那里你可以提出任何合适的问题,你可以得到一个相对较快的答案。 对于PyTorch用户来说,这个地方似乎比StackOverflow更受欢迎。 #### PyTorch as NumPy -So let’s dive into PyTorch itself. The main building block of the PyTorch is the tensors. Really, they are very similar to the [NumPy ones](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html). Tensors support a lot of the same API, so sometimes you may use PyTorch just as a drop-in replacement of the NumPy. You may ask what the reason is. The principal goal is that PyTorch can utilize GPU so that you can transfer your data preprocessing or any other computation hungry stuff to machine learning workhorse. And it’s very easy to convert tensors from NumPy to PyTorch and vice versa. Let’s check some examples in code: +让我们来讨论PyTorch本身,PyTorch的主要构建块是tensors。确实他和[NumPy ones](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)很相似。 Tensors支持很多和相同的API,因此有时可以使用PyTorch作为NumPy的代替品。你可能想问为什么要这么做,主要的原因是PyTorch的主要目标是使用GPU,这样您就可以将数据预处理或任何其他需要大量计算的内容转移到机器学习中。很容易就可以转换tensors从NumPy转换为PyTorch,反之亦然。我们用代码来举个例子: + - -#### From the tensors to the variables - -Tensors are an awesome part of the PyTorch. But mainly all we want is to build some neural networks. What is about backpropagation? Of course, we can manually implement it, but what is the reason? Thankfully automatic differentiation exists. To support it PyTorch [provides variables](http://pytorch.org/tutorials/beginner/examples_autograd/two_layer_net_autograd.html) to you. Variables are wrappers above tensors. With them, we can build our computational graph, and compute gradients automatically later on. Every variable instance has two attributes: `.data` that contain initial tensor itself and `.grad` that will contain gradients for the corresponding tensor. +张量是PyTorch一个很棒的部分. 。但我们想要的主要是建立一些神经网络。什么是反向传播?当然, 我们可以手动实现它, 但原因是什么?值得庆幸的是, 存在自动分化。为了支持它, pytorch 为您 [提供了变量](http://pytorch.org/tutorials/beginner/examples_autograd/two_layer_net_autograd.html) 。变量是张量的包装。有了它们, 我们就可以建立我们的计算图, 并在以后自动计算梯度。每个变量实例都有两个属性: `. data`, 其中包含初始张量本身, 而 `. gd` 将包含相应张量的渐变。 -You may note that we have manually computed and applied our gradients. It’s so tedious. Do we have some optimizer? Of course! +您可能会注意到, 我们手动计算并应用了渐变。我们有优化器吗?答案是肯定的! -Now all our variables will be updated automatically. But the main point that you should get from the last snippet: we still should manually zero gradients before calculating new ones. This is one of the core concepts of the PyTorch. Sometimes it may be not very obvious why we should do this, but on the other hand, we have full control over our gradients, when and how we want to apply them. +现在我们所有的变量都会自动更新。但是你应该从最后一个片段得到的要点:我们仍然应该在计算新的渐变之前手动归零。这是PyTorch的核心概念之一。有时为什么我们应该这样做可能不是很明显,但另一方面,我们可以完全控制我们的渐变,我们何时以及如何应用它们。 -#### Static vs. dynamic computational graphs +#### 静态与动态计算图的比较 -Next main difference between PyTorch and TensorFlow is their approach to the graph representation. Tensorflow [uses a static graph](https://www.tensorflow.org/programmers_guide/graphs), that means that we define it once and after execute that graph over and over again. In PyTorch each forward pass defines a new computational graph. In the beginning, the distinction between those approaches not so huge. But dynamic graphs became very handful when you want to debug your code or define some conditional statements. You can use your favorite debugger as it is! Compare next two definitions of the while loop statements - the first one in TensorFlow and the second one in PyTorch: +PyTorch和TensorFlow的下一个主要区别是它们对图形表示的方法。 Tensorflow [使用静态图表](https://www.tensorflow.org/programmers_guide/graphs),这意味着我们一次又一次地执行该图表后定义它。在PyTorch中,每个前向传递定义了一个新的计算图。一开始,这些方法之间的区别并不那么大。但是,当您想要调试代码或定义一些条件语句时,动态图变得非常少。您可以使用自己喜欢的调试器!比较while循环语句的下两个定义 - TensorFlow中的第一个定义和PyTorch中的第二个定义: @@ -54,143 +52,137 @@ It seems to me that second solution much easier than first one. And what do you #### Models definition -Ok, now we see that it’s easy to build some if/else/while complex statements in PyTorch. But let’s revert to the usual models. The framework provides out of the box layers constructors very similar to [Keras](https://keras.io/) ones: +好的,现在我们看到在PyTorch中构建一些if / else / while复杂语句很容易。但是让我们回到通常的模型。该框架提供了与[Keras](https://keras.io/) 非常相似的开箱即用层构造函数: -> The `nn` package defines a set of **Modules**, which are roughly equivalent to neural network layers. A Module receives input Variables and computes output Variables, but may also hold internal state such as Variables containing learnable parameters. The `nn` package also defines a set of useful loss functions that are commonly used when training neural networks. +> `nn`包定义了一组**模块**,大致相当于神经网络层。模块接收输入变量并计算输出变量,但也可以保持内部状态,例如包含可学习参数的变量。 `nn`包还定义了一组在训练神经网络时常用的有用损失函数。 -Also if we want to build more complex models, we may subclass provided `nn.Module` class. And of course, these two approaches can be mixed with each other. +另外,如果我们想构建更复杂的模型,我们可以子类提供`nn.Module`类。当然,这两种方法可以相互混合。 -At the `__init__` method we should define all layers that will be used later. In the `forward` method, we should propose steps how we want to use already defined layers. Backward pass, as usual, will be computed automatically. +在`__init__`方法中,我们应该定义稍后将使用的所有层。在`forward`方法中,我们应该提出我们想要使用已经定义的层的步骤。像往常一样,向后传递将自动计算。 -#### Self-defined layers +#### 自定义图层 -But what if we want to define some custom model with nonstandard backprop? Here is one example — XNOR networks: +但是如果我们想用非标准的backprop定义一些自定义模型呢?这是一个例子 - XNOR网络: ![img](https://cdn-images-1.medium.com/max/1000/1*cjzIFgglAP9xGKg8mlRysQ.png) -I will not dive into details, more about this type of networks you may read in the [initial paper](https://arxiv.org/abs/1603.05279). All relevant to our issue is that backpropagation should be applied only to weights that less than 1 and greater than -1. In PyTorch it [can be implemented quite easy](http://pytorch.org/docs/master/notes/extending.html): +我不会深入了解详细信息,更多关于您可能在[入门手册](https://arxiv.org/abs/1603.05279)中阅读的此类网络。与我们的问题相关的是,反向传播应仅适用于小于1且大于-1的权重。在PyTorch中,它[可以非常简单地实现](http://pytorch.org/docs/master/notes/extending.html): -So as you may see, we should only define exactly two methods: one for forward and one for backward pass. If we need access to some variables from the forward pass we may store them in the `ctx` variable. Note: in previous API forward/backward methods were not static and we stored required variables as `self.save_for_backward(input)` and access them as `input, _ = self.saved_tensors`. +你可能会看到,我们应该只定义两个方法:一个用于前进,一个用于后向传递。如果我们需要从前向传递中访问一些变量,我们可以将它们存储在`ctx`变量中。注意:在以前的API中,前向/后向方法不是静态的,我们将所需的变量存储为`self.save_for_backward(input)`并通过`input,_ = self.saved_tensors`访问。 -#### Train model with CUDA +#### 用CUDA训练模型 -If was discussed earlier how we might pass one tensor to the CUDA. But if we want to pass the whole model, it’s ok to call `.cuda()` method from the model itself, and wrap each input variable to the `.cuda()` and it will be enough. After all computations, we should get results back with `.cpu()` method. +如果之前讨论过如何将一个张量传递给CUDA。但是如果我们想要传递整个模型,可以从模型本身调用`.cuda()`方法,并将每个输入变量包装到`.cuda()`中就足够了。在所有计算之后,我们应该使用`.cpu()`方法返回结果。 -Also, PyTorch supports direct devices allocation at the source code: +此外,PyTorch支持源代码中的直接设备分配: -Because sometimes we want to run the same model on the CPU and the GPU without code modification I propose some kind of wrapper: +因为有时我们想在没有代码修改的情况下在CPU和GPU上运行相同的模型,我建议使用某种包装器: -#### Weight initialization +#### 权重初始化 -In TensorFlow weights initialization mainly are made during tensor declaration. PyTorch offers another approach — at first, tensor should be declared, and on the next step weights for this tensor should be changed. Weights can be initialized as direct access to the tensor attribute, as a call to the bunch of methods inside `torch.nn.init` package. This decision can be not very straightforward, but it becomes useful when you want to initialize all layers of some type with same initialization. +在TensorFlow中,权重初始化主要在张量声明期间进行。 PyTorch提供了另一种方法 - 首先应该声明张量,并且在下一步中应该改变该张量的权重。权重可以初始化为对tensor属性的直接访问,作为对`torch.nn.init`包中的一堆方法的调用。这个决定可能不是很简单,但是当你想用相同的初始化初始化某些类型的所有层时它会变得很有用。 -#### Excluding subgraphs from backward +#### 逆向排除子图 -Sometimes when you want to retrain some layers of your model or prepare it for the production mode, it’s great when you can disable autograd mechanics for some layers. For this purposes, [PyTorch provides two flags](http://pytorch.org/docs/master/notes/autograd.html): `requires_grad`and `volatile`. First one will disable gradients for current layer, but child nodes still can calculate some. The second one will disable autograd for current layer and for all child nodes. +有时,当您想要重新训练模型的某些层或为生产模式做好准备时,您可以为某些图层禁用自动编程机制。为此,[PyTorch提供了两个标志](http://pytorch.org/docs/master/notes/autograd.html):`requires_grad`和`volatile`。第一个将禁用当前图层的渐变,但子节点仍然可以计算一些。第二个将禁用当前层和所有子节点的autograd。 -#### Training process - -There are also exists some other bells and whistles in PyTorch. For example, you may use [learning rate scheduler](http://pytorch.org/docs/master/optim.html#how-to-adjust-learning-rate) that will adjust your learning rate based on some rules. Or you may enable/disable batch norm layers and dropouts with single train flag. If you want it’s easy to change random seed separately for CPU and GPU. - +#### 训练过程 +PyTorch中还存在一些其他的花里胡哨。例如,您可以使用[学习率调度程序](http://pytorch.org/docs/master/optim.html#how-to-adjust-learning-rate)来根据某些规则调整学习率。或者您可以使用简单的训练标志来启用或者禁用批次归一化和丢失。如果你想要为CPU和GPU分别更改随机种子,将会很容易实现。 -Also, you may print info about your model, or save/load it with few lines of code. If your model was initialized with [OrderedDict](https://docs.python.org/3/library/collections.html) or class-based model string representation will contain names of the layers. +此外,您可以打印有关模型的信息,或使用几行代码保存/加载它。如果您的模型使用[OrderedDict](https://docs.python.org/3/library/collections.html)或者基于类的模型字符串表示形式初始化的,那么将包含层的名称。 - +根据PyTorch文档保存模型,使用‘state_dict()’方法更为可取(http://pytorch.org/docs/master/notes/serializ.htm)。 -As per PyTorch documentation saving model with `state_dict()` method is [more preferable](http://pytorch.org/docs/master/notes/serialization.html). +根据PyTorch文档保存模型,使用`state_dict()`方法[更为可取](http://pytorch.org/docs/master/notes/serializ.htm)。 -#### Logging +#### 记录 -Logging of the training process is a pretty important part. Unfortunately, PyTorch has no any tools like tensorboard. So you may use usual text logs with [Python logging module](https://docs.python.org/3/library/logging.html) or try some of the third party libraries: +记录训练过程是一个非常重要的部分。不幸的是,PyTorch没有tensorboard这样的工具。因此,您可以使用[Python日志模块](https://docs.python.org/3/library/logging.html)中的常规文本日志,或者尝试一些第三方库: -- [A simple logger for experiments](https://github.com/oval-group/logger) -- [A language-agnostic interface to TensorBoard](https://github.com/torrvision/crayon) -- [Log TensorBoard events without touching TensorFlow](https://github.com/TeamHG-Memex/tensorboard_logger) -- [tensorboard for pytorch](https://github.com/lanpa/tensorboard-pytorch) -- [Facebook visualization library wisdom](https://github.com/facebookresearch/visdom) +- [一个用于实验的简单记录器](https://github.com/oval-group/logger) +- [TensorBoard与语言无关的界面](https://github.com/torrvision/crayon) +- [在不触及TensorFlow的情况下记录TensorBoard事件](https://github.com/TeamHG-Memex/tensorboard_logger) +- [pytorch使用tensorboard ](https://github.com/lanpa/tensorboard-pytorch) +- [Facebook可视化库智慧](https://github.com/facebookresearch/visdom) -#### Data handling +#### 数据处理 -You may remember [data loaders proposed in TensorFlow](https://www.tensorflow.org/api_guides/python/reading_data) or even tried to implement some of them. For me, it took about 4 hours or more to get some idea how all pipeline should work. +您可能还记得[TensorFlow中提出的数据加载器](https://www.tensorflow.org/api_guides/python/reading_data),甚至尝试实现其中的一些加载器。对我来说,花了大约4个小时或更多的时间来了解所有管道应该如何工作。 ![img](https://cdn-images-1.medium.com/max/1000/1*S00VU2HiEjNZ35zlj2kqfw.gif) -Image source: TensorFlow docs +图片来源:TensorFlow docs -Initially, I thought to add here some code, but I think such gif will be enough to explain basic idea how all things happen. +最初,我想在这里添加一些代码,但我认为这样的gif足以解释所有事情是如何发生的基本思想。 -PyTorch developers decided do not reinvent the wheel. They just use multiprocessing. To create your own custom data loader, it’s enough to inherit your class from `torch.utils.data.Dataset` and change some methods: +PyTorch的开发者决定不重新发明轮子。他们只是使用多处理。要创建自己的自定义数据加载器,从' torch.utils.data '继承类就足够了。数据集'和改变一些方法: -The two things you should know. First — image dimensions are different from TensorFlow. They are [batch_size x channels x height x width]. But this transformation can be made without you interaction by preprocessing step `torchvision.transforms.ToTensor()`. There are also a lot of useful utils in the [transforms package](http://pytorch.org/docs/master/torchvision/transforms.html). +你应该知道的两件事。首先 , 图像尺寸与TensorFlow不同。它们是[batch_size x channels x height x width]。但是,通过预处理步骤`torchvision.transforms.ToTensor()`,可以在没有您交互的情况下进行此转换。[转化包](http://pytorch.org/docs/master/torchvision/transforms.html)中还有很多有用的工具。 -The second important thing that you may use pinned memory on GPU. For this, you just need to place additional flag `async=True` to a `cuda()` call and get pinned batches from DataLoader with flag `pin_memory=True`. More about this feature [discussed here](http://pytorch.org/docs/master/notes/cuda.html#use-pinned-memory-buffers). - -#### Final architecture overview - -Now you know about models, optimizers and a lot of other stuff. What is the right way to merge all of them? I propose to split your models and all wrappers on such building blocks: +第二个重要的事情是你可以在GPU上使用固定内存。为此,您只需要在`cuda()`调用中添加另外的标志`async = True`,并从DataLoader获取带有标志`pin_memory = True`的固定批次。 [更多相关讨论](http://pytorch.org/docs/master/notes/cuda.html#use-pinned-memory-buffers). +#### 最后的体系结构概述 +现在你知道了模型,优化器和很多其他的东西。合并它们的正确方法是什么?我建议将你的模型和所有包装在这样的积木上: ![img](https://cdn-images-1.medium.com/max/1000/1*A-cWYNur2lqDEhUF1_gdCw.png) -And here is some pseudo code for clarity: - - +这里有一些用于阐述的伪代码: -#### Conclusion +#### 总结 -I hope with this post you’ve understood main points of PyTorch: +我希望通过这篇文章,你能理解PyTorch的要点: -- It can be used as drop-in replacement of Numpy -- It’s really fast for prototyping -- It’s easy to debug and use conditional flows -- There are lots of great tools out of the box +- 它可以作为临时代替Numpy +- 这对于原型设计来说非常快 +- 调试和使用条件流很容易 +- 有很多现成的好工具 -PyTorch is the fast-growing framework with an awesome community. And I think that today is the best day to try it out! \ No newline at end of file +PyTorch是一个快速发展的框架,拥有很棒的社区。我认为今天的尝试很棒! \ No newline at end of file diff --git "a/20171015 \347\254\25413\346\234\237/The Search for Better Search at Reddit.md" "b/20171015 \347\254\25413\346\234\237/The Search for Better Search at Reddit.md" index 1a996d994d74e7cfabd4c0950ed1da23a12488fd..461fbe9536ed3c37c8dc783c6dc6bf58af8d51c3 100644 --- "a/20171015 \347\254\25413\346\234\237/The Search for Better Search at Reddit.md" +++ "b/20171015 \347\254\25413\346\234\237/The Search for Better Search at Reddit.md" @@ -1,173 +1,132 @@ -# The Search for Better Search at Reddit +# 在Reddit上搜索更好的搜索 原文链接:[The Search for Better Search at Reddit](https://redditblog.com/2017/09/07/the-search-for-better-search-at-reddit/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com) -Because, certainly, we’ve solved it this time + [TECHNOLOGY](https://redditblog.com/topic/technology/) [Staff](https://redditblog.com/author/blabyrinth/) • [September 7, 2017](https://redditblog.com/2017/09/07/the-search-for-better-search-at-reddit/) **Chris Slowe, Nick Caldwell, & Luis Bitencourt-Emilio***CTO, VP of Engineering, Director of Engineering* -## **What’s the Fuss?** - -A common question we get from newbie engineering team members here at Reddit is “When are we going to fix search?” Until this year, the answer was always “Go ask the search team on the 5th floor.” Which was great fun because a) the elevator button to the 5th floor didn’t work and b) there was no search team. - -But the times, they are a-changin’. We’re happy to announce that we’re launching a new search engine at Reddit. Actually, it’s been launched to 50% of traffic for the past couple weeks and has already served up nearly half a billion queries. Now that we’re confident in our system, we’re pushing it to 100% of traffic. We hope you enjoy faster and more reliable results! - -More importantly, we’ve also started an entire product unit dedicated to search and relevance here at Reddit, led by our Director of Engineering Luis. We recognize that these technologies are critical to Reddit’s future. Our platform contains one of the world’s most interesting collections of content, currently indexing over a quarter billion posts for search, and it gets bigger every day. But we know this content is hard to find. Improving search and relevance will allow Reddit to sift through millions of posts, comments, and communities to create a custom-fit stream of great content straight to your home feed. - -That’s the future. For now, we thought it’d be fun to take a trip down memory lane. - - - -## **A Brief History of Reddit Search** - -Needless to say, search is not an easy challenge to solve. We’ve been on a bit of a roller coaster when it comes to search at Reddit, but now that we’re on our sixth search stack, we’re no strangers to the struggles of doing search at scale. Below is a rough outline of the 12-year history, along with a few select quotes from the team as we’ve iterated to scale our infra to Reddit’s needs: - -- 2005 – Steve Huffman ([u/spez](https://www.reddit.com/user/spez)), co-founder and now CEO, turns on postgres 7.4’s contrib/[tsearch2](http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/). - - This was a simpler time, when the statement “Oh, we can just have Postgres do it!” was greeted with “Sounds good to me! What - - can’t - - Postgres do!?” We also really liked +## **什么是Fuss?** - TRIGGER +我们从Reddit的新手工程团队成员那里得到的一个常见问题是“我们什么时候才能修复搜索?”直到今年,答案总是“去询问5楼的搜索团队。”这很有趣,因为 - s back then (“No, it’s cool. The database does all the work and it’s guaranteed to be accurate” is something we no doubt said). It worked well, but it wasn’t very tunable, and we quickly discovered we were bogging down the majority of Postgres queries with a small minority (~2%) of search traffic: +1. 到5楼的电梯按钮没有工作 +2. 没有搜索团队 - - “We fixed a bug in the search results ordering.” —[Steve](https://redditblog.com/2006/02/27/if-you-want-something-done-right-do-it-yourself/) - - “We updated the search system this morning to help alleviate some load problems.” —[Steve](https://redditblog.com/2006/07/25/searching/) - - “Jeremy is working on search! It’s not a complicated fix (basically, the sorting is whacky).” —[Steve](https://redditblog.com/2007/04/28/updates/) +但是时代在进步,这是一个改革。我们很高兴地宣布,我们正在Reddit推出一个新的搜索引擎。实际上,在过去的几周里它已经启动了50%的流量,并且已经提供了近5亿次查询。现在我们对我们的系统充满信心,我们将其推向100%的流量。我们希望您享受更快,更可靠的结果! -- 2007 – Chris Slowe ([u/KeyserSosa](https://www.reddit.com/user/KeyserSosa)), founding engineer (and now CTO), re-implements with PyLucene. +更重要的是,我们还在Reddit开设了一个专门用于搜索和相关的整个产品部门,由我们的工程总监Luis领导。我们认识到这些技术对Reddit的未来至关重要。我们的平台包含世界上最有趣的内容集合之一,目前索引超过25亿个搜索帖子,并且它每天都在变大。但我们知道这个内容很难找到。改进搜索和相关性将使Reddit能够筛选数百万个帖子,评论和社区,以便直接为您的家庭Feed提供定制的精彩内容流。 - - - This was actually implemented just over 10 years ago in July 2007. It consisted of a single Python process which was set up as a threaded RPC server over TCP. In the initial version, we had actually supported searching for both post titles and comments, and the Lucene index files were comfortably stored on a single box. This was also before we - - moved to AWS - - , and at the time we had seriously considered getting a - - Google Search Appliance - - , which would have made a nice addition to our - - single +那就是未来。就目前而言,我们认为沿着记忆之路旅行会很有趣。 - rack. This version was flexible, but we didn’t set it up in a way to make it easily scalable: - - - “Search works much better, tagging and user-controlled subreddits are right around the corner” —[Steve](https://redditblog.com/2007/07/26/new-reddit-on-the-horizon/) - - “Search is better, but not quite where we’d like it.” —[Steve](https://redditblog.com/2007/08/21/its-slow-its-unstable-its-beta/) - - “Stats and search are temporarily disabled, but will be coming back as soon as we can get them repaired.” —[Steve](https://redditblog.com/2007/10/16/reddit-status-update/) - - “We were hoping to include an upgraded search, which, unlike the last version, was actually useful and helped you find what you were looking for. Unfortunately, the version we settled on didn’t quite load test as nicely” —[Steve](https://redditblog.com/2007/10/18/reddit-status-update-part-ii/) - - “I made a quick fix to search that I hope helps until we get a chance to really fix it.” —[Steve](https://redditblog.com/2007/06/08/a-note-on-search-and-what-were-working-on/) + -- 2008 – David King ([u/ketralnis](https://www.reddit.com/user/ketralnis)), third employee and now search engineer, implements Solr. +## **Reddit搜索简洁的历史** - In fact, he implemented a home-built pysolr, which was capable of shipping update documents to Solr in XML and wrapping the response in such a way as to emulate our existing +不用说,搜索不是一个容易解决的挑战。在Reddit上搜索的时候,我们就像坐过山车一样,但现在我们已经是第六次搜索了,我们对大规模搜索的困难并不陌生。下面是关于12年历史的粗略概述,以及一些来自我们团队的精选引语,我们通过迭代来将我们的infra扩展到Reddit的需求: - Query +- 2005 – Steve Huffman ([u/spez](https://www.reddit.com/user/spez)), 创始人之一,现任首席执行官, 开启了 postgres 7.4’s contrib/[tsearch2](http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/). - models enough to drop it into any sort or listing. It was actually pretty sweet. The initial version didn’t support comments, but that did come later. + 当有人说 “哦,我们可以用 Postgres来完成!” , “对我来说听起来不错!?” “我们当时也非常喜欢TRIGGERs(”不,这很酷。数据库完成所有工作,并且保证准确无误“是我们毫无疑问的说法)。它工作得很好,但它不是很可调,我们很快发现我们正在以少数(约2%)搜索流量阻塞大多数Postgres查询: - - “[David]’s been fixing search and hacking mystery projects in Erlang.” —[Alexis Ohanian](https://redditblog.com/2008/04/17/welcome-david/) - - “I’ve totally replaced the reddit search function.” —[David King](https://redditblog.com/2008/04/21/new-search-2/) + - “我们修复了搜索结果排序中的错误。” —[Steve](https://redditblog.com/2006/02/27/if-you-want-something-done-right-do-it-yourself/) + - “我们今天早上更新了搜索系统,以帮助缓解一些负载问题。” —[Steve](https://redditblog.com/2006/07/25/searching/) + - “Jeremy正致力于搜索!这不是一个复杂的修复(排序很糟糕)。” —[Steve](https://redditblog.com/2007/04/28/updates/) -- 2010 – David replaces Solr with IndexTank, a third-party search provider. +- 2007 – Chris Slowe ([u/KeyserSosa](https://www.reddit.com/user/KeyserSosa)), 创始工程师(现在是CTO),与PyLucene一起重新实施。 - When you love something, outsource it… said no one ever. As the site continued to grow and we first cracked a billion - - pageviews + 这实际上是在10年前的2007年7月实现的。它由一个Python进程组成,该进程被设置为TCP上的线程RPC服务器。在初始版本中,我们实际上支持搜索帖子标题和评论,并且Lucene索引文件可以舒适地存储在一个盒子上。这也是在我们搬到AWS之前,当时我们已经认真考虑过使用Google Search Appliance,这对我们的单机架来说是一个很好的补充。这个版本很灵活,但我们没有以一种易于扩展的方式进行设置: - in a month with an engineering team of four, we put all of our effort into 503 mitigation, continuing to add Postgres read slaves, adding more cache, starting to take advantage of a + - “搜索的效果变得更好这标记着用户可以更好的进行控制。” —[Steve](https://redditblog.com/2007/07/26/new-reddit-on-the-horizon/) + - “搜索效果更好,但是不是我们喜欢的地方。” —[Steve](https://redditblog.com/2007/08/21/its-slow-its-unstable-its-beta/) + - “统计数据和搜索暂时被禁用,但只要我们能够修复它们就会回来。” —[Steve](https://redditblog.com/2007/10/16/reddit-status-update/) + - “我们希望包含升级后的搜索,与上一版本不同,它实际上非常有用,可以帮助您找到所需内容。不幸的是,我们确定的版本并没有很好地加载测试。” —[Steve](https://redditblog.com/2007/10/18/reddit-status-update-part-ii/) + - “我快速修复了搜索,我希望有所帮助,直到我们有机会真正解决它。” —[Steve](https://redditblog.com/2007/06/08/a-note-on-search-and-what-were-working-on/) - very early version of Cassandra +- 2008 – David King ([u/ketralnis](https://www.reddit.com/user/ketralnis)), 第三名员工,现在是搜索工程师,实施Solr。 - (which was followed shortly thereafter by a memorable 24-hour, thundering-herd-related outage), and generally ignoring how bad search was getting. We had an intrepid startup approach us and offer to take search off of our hands + 实际上,他实现了一个自制的pysolr,它能够以XML格式将更新文档发送给Solr,并以这样的方式包装响应,以便模拟我们现有的Query模型,足以将其放入任何类型或列表中。它实际上很甜蜜。初始版本不支持评论,但后来确实如此。 - forever + - “[David]一直在修复Erlang的搜索和黑客攻击项目。” —[Alexis Ohanian](https://redditblog.com/2008/04/17/welcome-david/) + - “我完全取代了reddit搜索功能。” —[David King](https://redditblog.com/2008/04/21/new-search-2/) - for less than we were paying to keep Solr running, so we signed on! +- 2010 – David将Solr替换为第三方搜索提供商IndexTank。 - - “We launched a new search engine yesterday. Calm down. It’s okay. I know. You’ve been hurt before.” —[David King](https://redditblog.com/2010/07/21/new-search/) + 当你喜欢某些东西时,将其外包......从来没有人说过。随着网站持续增长,我们首先在一个月内与一个四人工程团队一起破解了十亿次网页浏览,我们将所有努力投入503缓解,继续添加Postgres读取,添加更多缓存,开始利用Cassandra的早期版本(之后很快就发生了一次令人难忘的停电),并且通常无视搜索的糟糕程度。我们有一个勇敢的决定,永远使用第三方搜索提供商,比我们为保持Solr运行所付出的更少,所以我们签了! -- 2012 – Keith Mitchell (u/kemitche) implements CloudSearch after LinkedIn shut down IndexTank. + - “我们昨天推出了一个新的搜索引擎。冷静。没关系。我知道。你以前受伤了。” —[David King](https://redditblog.com/2010/07/21/new-search/) - Clearly, it was one of the +- 2012 – Keith Mitchell (u/kemitche) 在LinkedIn关闭IndexTank后实施CloudSearch。 - shorter + 很明显,这个永远过于短暂,IndexTank在公司被收购之前为我们提供了很好的帮助。当我们发现他们正在关闭时,我们不得不离开IndexTank并快速过渡到AWS CloudSearch。继续我们长期以来的传统“让新人照顾它”,这项任务落到了Keith身上,在接下来的几年里,我们将CloudSearch扩展到了爆炸状态: - forevers, but IndexTank served us well until the company was acquired. When we found out they were shutting down, we had to ween off of IndexTank and make a quick transition to AWS CloudSearch. Continuing our long-standing tradition of ‘Let the new guy take care of it,’ that task fell to Keith, and over the next several years we scaled and stretched CloudSearch to bursting: - - - “Today we moved from the old Amazon CloudSearch domain to a new Amazon CloudSearch domain. The old search domain had significant performance issues: roughly 33% of queries took over 5 seconds to complete and would result in the search error page.” —[u/bsimpson](https://www.reddit.com/r/changelog/comments/694o34/reddit_search_performance_improvements/) + - “今天,我们从旧的Amazon CloudSearch域迁移到新的Amazon CloudSearch域。旧的搜索域存在严重的性能问题:大约33%的查询需要5秒才能完成,并且会导致搜索错误页面。” —[u/bsimpson](https://www.reddit.com/r/changelog/comments/694o34/reddit_search_performance_improvements/) - TODAY – Lucidworks Fusion! - - - This time around, we wanted to ensure that search would meet three criteria: it needed to be fast, it needed to scale well with Reddit’s growth, and most importantly, it needed to be relevant. Ultimately, this led us to partner with the search experts at Lucidworks, leveraging Fusion and their unique search expertise from a team comprised of multiple Solr committers. Below, we’ll explain how we went about this in more detail. + 这一次,我们希望确保搜索符合三个标准:它需要快速,需要与Reddit的增长很好地扩展,最重要的是,它需要具有相关性。最终,这促使我们与Lucidworks的搜索专家合作,利用Fusion及其由多个Solr提交者组成的团队的独特搜索专业知识。下面,我们将更详细地解释我们如何进行此操作。 - - “As [/u/bitofsalt](https://www.reddit.com/u/bitofsalt) [mentioned a few months ago](https://www.reddit.com/r/funny/comments/65ryr3/and_now_a_look_at_the_machine_that_powers_reddits/dgd22mi/), we’ve been working on some improvements to search. We may even be ahead of [spez’s 10 year plan](https://www.reddit.com/r/announcements/comments/59k22p/hey_its_reddits_totally_politically_neutral_ceo/d992fwq/?context=1).” —[u/starfishjenga](https://www.reddit.com/r/changelog/comments/6pi0kk/improving_search/) + - “As [/u/bitofsalt](https://www.reddit.com/u/bitofsalt) [几个月前我们提到过](https://www.reddit.com/r/funny/comments/65ryr3/and_now_a_look_at_the_machine_that_powers_reddits/dgd22mi/), 我们一直在努力改进搜索。我们甚至可能领先于 [spez’s 的10年计划](https://www.reddit.com/r/announcements/comments/59k22p/hey_its_reddits_totally_politically_neutral_ceo/d992fwq/?context=1).” —[u/starfishjenga](https://www.reddit.com/r/changelog/comments/6pi0kk/improving_search/) -## **Once More with Feeling** +## **感受更多** -Earlier this year, search on Reddit had become truly abysmal. Simple queries could be expected to succeed only half of the time. Want to search with two keywords? Get out of here! +今年早些时候,对Reddit的搜索变得非常糟糕。简单的查询只能在一半的时间内成功。想要使用两个关键字进行搜索?离开这里! ![img](https://redditupvoted.files.wordpress.com/2017/09/screen-shot-2017-09-07-at-11-43-15-am.png?w=720&h=505) -Fig. 1: Example error page when our CloudSearch cluster is under heavy load. +图1:我们的CloudSearch集群负载过重时的示例错误页面。 -After looking at several options, we partnered with with [Lucidworks](https://lucidworks.com/) to revitalize Reddit’s search system. Lucidworks is the creator of Fusion, a Solr-based search stack that supports huge document scale and high query throughput. +在查看了几个选项后,我们与 [Lucidworks](https://lucidworks.com/) 合作,重振Reddit的搜索系统。 Lucidworks是Fusion的创建者,Fusion是一个基于Solr的搜索堆栈,支持巨大的文档规模和高查询吞吐量。 -## **First Things First: Ingesting at Reddit Scale** +## **第一件事:以Reddit量表摄取** -The biggest challenge in moving to a new search system was that our indexing pipeline needed to be updated. The first attempt was a bit of a beast. In the interest of speed, we hastily put it together on our legacy ETL system comprised of [Jenkins](https://jenkins.io/) and [Azkaban](https://azkaban.github.io/) orchestrating numerous Hive queries. As you can see in the diagram below, pulling together data from several sources into one cohesive canonical view to be indexed proved to be more complex than originally expected. +迁移到新搜索系统的最大挑战是我们的索引管道需要更新。第一次尝试很艰难。为了速度,我们匆忙将它放在我们由 [Jenkins](https://jenkins.io/)和[Azkaban](https://azkaban.github.io/) 组成的遗留ETL系统上,编排了许多Hive查询。正如您在下图中所看到的,将来自多个来源的数据汇总到一个有索引的规范视图中进行索引,证明比最初预期的更复杂。 ![img](https://redditupvoted.files.wordpress.com/2017/09/screen-shot-2017-09-07-at-11-44-37-am.png?w=720&h=433) -Fig. 2: First iteration at our new search ingestion pipeline, now replaced with a significantly simplified version. +图2:我们新的搜索提取管道的第一次迭代,现在被替换为显着简化的版本。 -Our second attempt was both simpler and produced significantly better results. We managed to trim the entire pipeline to just four simpler and more accurate Hive queries, which led to a 33% increase in posts indexed. Another great improvement is that we not only index new post creations but also update their relevance signals in real time as votes, comments, and other signals flow in throughout the day. +我们的第二次尝试既简单又产生了明显更好的结果。我们设法将整个管道修剪为仅仅四个更简单和更准确的Hive查询,这使得索引的帖子增加了33%。另一个重大改进是,我们不仅索引新的帖子创作,而且还实时更新其相关信号,因为投票,评论和其他信号全天都在流动。 -## **Make it Relevant** +## **使其相关** -Search results don’t mean much if they’re not relevant. For our initial rollout the primary goal was to avoid degrading the overall relevance of results returned. +如果它们不相关,搜索结果并不意味着什么。对于我们的初始部署,主要目标是避免降低返回结果的整体相关性。 -To monitor this, we measured clicks on the search results page and compared the rank of results being clicked across old and new search systems. A perfect search engine would yield 100% of clicks on the top result being returned, which is another way of saying you want the most relevant result at the top. Since we know a perfect search engine isn’t an achievable goal, we use measures like [Mean Reciprocal Rank](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) and [Discounted Cumulative Gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) to compare the quality of our results. +为了监控这一点,我们测量了搜索结果页面上的点击次数,并比较了在新旧搜索系统中点击的结果的排名。一个完美的搜索引擎会在返回的最高结果上产生100%的点击次数,这是另一种表示您希望在顶部获得最相关结果的方式。由于我们知道完美的搜索引擎不是一个可实现的目标,我们使用 [平均互惠等级](https://en.wikipedia.org/wiki/Mean_reciprocal_rank)和 [折扣累积增益](https://en.wikipedia.org/wiki/Discounted_cumulative_gain)等措施来比较我们的结果质量。 -While it’s still early in our experiments, the data so far points towards very comparable relevancy measurements between our old vs. new stacks, with Fusion having a slight edge. The promising part of this is that we haven’t done much relevancy tuning yet — something that our new system actually supports. Advancements like personalization, machine learning models, and query intent and rewriting are now low-hanging fruit. +虽然它在我们的实验中还处于早期阶段,但迄今为止的数据指向了我们的旧堆栈与新堆栈之间非常可比的相关性测量,而Fusion具有轻微的优势。这个有希望的部分是我们还没有进行太多的相关调整 - 这是我们的新系统实际支持的东西。个性化,机器学习模型以及查询意图和重写等进步现在已经成为现实。 ![img](https://redditupvoted.files.wordpress.com/2017/09/screen-shot-2017-09-07-at-11-46-10-am.png?w=720&h=276) -Fig. 3: Comparison of search result click positions between Fusion and CloudSearch stacks. +图3:Fusion和CloudSearch堆栈之间搜索结果点击位置的比较。 -## **The Rollout** +## **首次展示** -As we overcame the data ingestion challenges and monitored relevance, we continued to ramp up usage to more and more redditors. The [feedback](https://www.reddit.com/r/changelog/comments/6pi0kk/improving_search/) from this early group was invaluable, and we owe the community a huge thank-you in helping us surface bugs and less common use cases. We started out with just 1% of users on the new stack, working through issues reported and improving the ingestion pipeline as we increased rollout percentages to 5, 10, 25 and ultimately 50% of traffic prior to GA. Throughout this time, we sent all search queries as dark traffic to our new search cluster to ensure it would be ready for full scale as we increased rollout percentages. +在我们克服数据提取挑战和监控相关性时,我们继续将使用率提高到越来越多的redditors。这个早期小组的 [反馈](https://www.reddit.com/r/changelog/comments/6pi0kk/improving_search/)非常宝贵,我们非常感谢社区帮助我们解决漏洞和不太常见的用例。我们在新筹码上只有1%的用户开始工作,处理报告的问题并改进了摄取管道,因为我们在GA之前将推出百分比提高到5,10,25和最终50%的流量。在这段时间里,我们将所有搜索查询作为黑暗流量发送到我们的新搜索群集,以确保随着我们增加推出百分比,它可以全面扩展。 ![img](https://redditupvoted.files.wordpress.com/2017/09/screen-shot-2017-09-07-at-11-47-24-am.png?w=720&h=376) -Fig. 4: CloudSearch Errors in yellow and Fusion in green. +图4:黄色的CloudSearch错误和绿色的Fusion。 -We’re proud to say that Reddit Search is better than ever! A full reindex of all Reddit content now completes in about 5 hours (down from around 11 hours), and we’re constantly streaming live updates to the index. The error rate is down by two orders of magnitude with 99% of search results served in under 500ms. The number of machines needed to run search dropped from ~200 earlier this year down to ~30 so we even managed to get some cost savings. +我们很自豪地说Reddit Search比以往更好!所有Reddit内容的完整重新索引现在在大约5个小时内完成(从大约11个小时开始),我们不断将实时更新流式传输到索引。错误率下降了两个数量级,99%的搜索结果在500毫秒内完成。运行搜索所需的机器数量从今年早些时候的约200台减少到30台左右,因此我们甚至设法节省了一些成本。 ![img](https://redditupvoted.files.wordpress.com/2017/09/screen-shot-2017-09-07-at-11-48-12-am.png?w=720&h=308) -Fig. 5: Overview of Reddit’s new search stack. +图5:Reddit新搜索堆栈概述。 -Faster, more reliable, more relevant, and lower cost! Certainly this shall be the last time we ever need to change our search stack! +更快,更可靠,更相关,更低成本!当然这应该是我们最后一次需要更改搜索堆栈! -## **The Future** +## **展望未来** -In all seriousness, we think you’ll love this update. It’s our hope that the new search stack will be a foundation for improvements that make it easier to discover all the great content on Reddit. More importantly: we’re not done. Fixing search is just the first step in a series of new capabilities that will make Reddit feel more personalized and relevant to your interests. Reddit finally has a Search & Relevance team, and we are hiring like crazy. If you’re excited about working with one of the world’s most interesting datasets on a search and relevance platform used by hundreds of millions of people, then check out our job listings: +严肃地说,我们认为你会喜欢这个更新。我们希望新的搜索堆栈将成为改进的基础,以便更容易地发现Reddit上的所有优秀内容。更重要的是:我们没有完成。修复搜索只是一系列新功能的第一步,这些功能将使Reddit更加个性化并与您的兴趣相关。 Reddit最终拥有一个Search&Relevance团队,我们正在疯狂招聘。如果您对在数亿人使用的搜索和相关性平台上使用世界上最有趣的数据集之一感到兴奋,那么请查看我们的工作列表: **Head of Search:** [https://boards.greenhouse.io/reddit/jobs/723000#.Wa3yONOGOEI ](https://boards.greenhouse.io/reddit/jobs/723000#.Wa3yONOGOEI)**Head of Relevance:** [https://boards.greenhouse.io/reddit/jobs/611466#.WbC_ltOGOEI ](https://boards.greenhouse.io/reddit/jobs/611466#.WbC_ltOGOEI)**Head of Discovery:** [https://boards.greenhouse.io/reddit/jobs/764831#.WbC_2NOGOEI ](https://boards.greenhouse.io/reddit/jobs/764831#.WbC_2NOGOEI)**Search Engineers:** -Finally, thanks to the Lucidworks team for an amazing partnership and helping us end the search for better search at Reddit. \ No newline at end of file +最后,感谢Lucidworks团队提供了一个惊人的合作伙伴关系,并帮助我们在Reddit上寻找更好的搜索。 \ No newline at end of file