原文链接:[MAPREDUCE PATTERNS, ALGORITHMS, AND USE CASES](https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
`class` `Mapper`` ``method Map(docid id, doc d)`` ``H = ``new` `AssociativeArray`` ``for` `all term t in doc d ``do`` ``H{t} = H{t} + 1`` ``for` `all term t in H ``do`` ``Emit(term t, count H{t})`
```
为了使一个Mapper节点同时累加单个文档和全部文档,我们可以利用组合器(Combiner):
```
`class` `Mapper`` ``method Map(docid id, doc d)`` ``for` `all term t in doc d ``do`` ``Emit(term t, count 1)` `class` `Combiner`` ``method Combine(term t, [c1, c2,...])`` ``sum = 0`` ``for` `all count c in [c1, c2,...] ``do`` ``sum = sum + c`` ``Emit(term t, count sum)` `class` `Reducer`` ``method Reduce(term t, counts [c1, c2,...])`` ``sum = 0`` ``for` `all count c in [c1, c2,...] ``do`` ``sum = sum + c`` ``Emit(term t, count sum)`
**Problem Statement:** There is a set of items and some function of one item. It is required to save all items that have the same value of function into one file or perform some other computation that requires all such items to be processed as a group. The most typical example is building of inverted indexes.
`class` `Mapper`` ``method Map(id n, object N)`` ``Emit(id n, object N)`` ``for` `all id m in N.OutgoingRelations ``do`` ``Emit(id m, message getMessage(N))` `class` `Reducer`` ``method Reduce(id m, [s1, s2,...])`` ``M = null`` ``messages = []`` ``for` `all s in [s1, s2,...] ``do`` ``if` `IsObject(s) then`` ``M = s`` ``else` `// s is a message`` ``messages.add(s)`` ``M.State = calculateState(messages)`` ``Emit(id m, item M)`
1. Lin J. Dyer C. Hirst G. [Data Intensive Processing MapReduce](http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421/)
`class` `Mapper`` ``method Map(rowkey key, tuple t)`` ``tuple g = project(t) ``// extract required fields to tuple g`` ``Emit(tuple g, null)` `class` `Reducer`` ``method Reduce(tuple t, array n) ``// n is an array of nulls`` ``Emit(tuple t, null)`
```
## 合并(Union)
Mappers包括两个数据集中的全部记录。Reducer是用来消除重复值
```
`class` `Mapper`` ``method Map(rowkey key, tuple t)`` ``Emit(tuple t, null)` `class` `Reducer`` ``method Reduce(tuple t, array n) ``// n is an array of one or two nulls`` ``Emit(tuple t, null)`
`class` `Mapper`` ``method Map(rowkey key, tuple t)`` ``Emit(tuple t, null)` `class` `Reducer`` ``method Reduce(tuple t, array n) ``// n is an array of one or two nulls`` ``if` `n.size() = 2`` ``Emit(tuple t, null)`
`class` `Mapper`` ``method Map(null, tuple [join_key k, value v1, value v2,...])`` ``Emit(join_key k, tagged_tuple [set_name tag, values [v1, v2, ...] ] )` `class` `Reducer`` ``method Reduce(join_key k, tagged_tuples [t1, t2,...])`` ``H = ``new` `AssociativeArray : set_name -> values`` ``for` `all tagged_tuple t in [t1, t2,...] ``// separate values into 2 arrays`` ``H{t.tag}.add(t.values)`` ``for` `all values r in H{``'R'``} ``// produce a cross-join of the two arrays`` ``for` `all values l in H{``'L'``}`` ``Emit(null, [k r l] )`
1.[Join Algorithms using Map/Reduce](http://www.inf.ed.ac.uk/publications/thesis/online/IM100859.pdf)
2.[Optimizing Joins in a MapReduce Environment](http://infolab.stanford.edu/~ullman/pub/join-mr.pdf)
# 机器学习和数学方面的MapReduce算法
- C. T. Chu *et al* provides an excellent description of machine learning algorithms for MapReduce in the article [Map-Reduce for Machine Learning on Multicore](http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf).
- FFT using MapReduce: <http://www.slideshare.net/hortonworks/large-scale-math-with-hadoop-mapreduce>
- MapReduce for integer factorization: <http://www.javiertordable.com/files/MapreduceForIntegerFactorization.pdf>
- Matrix multiplication with MapReduce: <http://csl.skku.edu/papers/CS-TR-2010-330.pdf> and <http://www.norstad.org/matrix-multiply/index.html>
原文链接:[DISTRIBUTED ALGORITHMS IN NOSQL DATABASES](https://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
- (H,Transactional Read Quorum Write Quorum and Read One Write All)Quorum机制也可以通过事务控制技术来避免写冲突。众所周知的方法是使用两阶段提交协议。但两阶段提交并不完全可靠,因为协调器(coordinator)发生故障会导致资源阻塞。 PAXOS提交协议[20,21]是一种更可靠的替代方案,但会损失一点性能。在此基础上,我们最终得到了Read One Write All的方法,即把所有副本的更新放在一个事务中,这种方法提供了强容错一致性但会损失掉一些性能和可用性。
-**一致-可扩展性权衡**。可以看出,即使读写一致性的保证严重限制了副本集可扩展性,但在的原子级写入的模式中以相对可扩展的方式,写冲突依然是可以解决。原子读改写模型通过给数据加上临时性的全局锁来避免冲突。这表明, *数据或操作之间的依赖,即使是很小范围内或很短时间的,也会损害扩展性*。所以精心设计数据模型,[careful data modeling](https://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/)将数据分片分开存放对于扩展性非常重要。
每个数据项都是一个用户信息,有三个属性First Name, Last Name, and Phone Number。这些属性建立了一个三维空间,一种可能的数据放置策略就是把每个象限映射到一个物理节点。像“First Name = John”这样的查询就对应于与四个象限相交的平面,因此只有四个节点会涉及这次查询。而有两个限制的查询相当于空间中的一条直线,这个查询会与两个象限相交(如上图所示),因此只有两个节点涉及这次查询。
1.[M. Shapiro et al. A Comprehensive Study of Convergent and Commutative Replicated Data Types](http://hal.inria.fr/docs/00/55/55/88/PDF/techreport.pdf)
2.[I. Stoica et al. Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications](http://pdos.csail.mit.edu/papers/chord:sigcomm01/chord_sigcomm.pdf)
3.[R. J. Honicky, E.L.Miller. Replication Under Scalable Hashing: A Family of Algorithms for Scalable Decentralized Data Distribution](http://www.ssrc.ucsc.edu/Papers/honicky-ipdps04.pdf)
4.[G. Shah. Distributed Data Structures for Peer-to-Peer Systems](http://cs-www.cs.yale.edu/homes/shah/pubs/thesis.pdf)
5.[A. Montresor, Gossip Protocols for Large-Scale Distributed Systems](http://sbrc2010.inf.ufrgs.br/resources/presentations/tutorial/tutorial-montresor.pdf)
6.[R. Escriva, B. Wong, E.G. Sirer. HyperDex: A Distributed, Searchable Key-Value Store](http://hyperdex.org/papers/hyperdex.pdf)
7.[A. Demers et al. Epidemic Algorithms for Replicated Database Maintenance](http://net.pku.edu.cn/~course/cs501/2009/reading/1987-SPDC-Epidemic%20algorithms%20for%20replicated%20database%20maintenance.pdf)
8.[G. DeCandia, et al. Dynamo: Amazon’s Highly Available Key-value Store](http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf)
9.[R. van Resesse et al. Efficient Reconciliation and Flow Control for Anti-Entropy Protocols](http://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf)
10.[S. Ranganathan et al. Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters](http://www.hcs.ufl.edu/pubs/CC2000.pdf)
12.[N. Hayashibara, X. Defago, R. Yared, T. Katayama. The Phi Accrual Failure Detector](http://cassandra-shawn.googlecode.com/files/The%20Phi%20Accrual%20Failure%20Detector.pdf)
13.[M.J. Fischer, N.A. Lynch, and M.S. Paterson. Impossibility of Distributed Consensus with One Faulty Process](http://www.cs.mcgill.ca/~carl/impossible.pdf)
14.[N. Hayashibara, A. Cherif, T. Katayama. Failure Detectors for Large-Scale Distributed Systems](http://ddg.jaist.ac.jp/pub/HCK02.pdf)
15. M. Leslie, J. Davies, and T. Huffman. A Comparison Of Replication Strategies for Reliable Decentralised Storage
16.[A. Lakshman, P.Malik. Cassandra – A Decentralized Structured Storage System](http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf)
17. N. A. Lynch. Distributed Algorithms
18. G. Tel. Introduction to Distributed Algorithms
23.[J. C. Corbett et al. Spanner: Google’s Globally-Distributed Database](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/spanner-osdi2012.pdf)