wiki:waue/2009/0715

Context Navigation

Version 13 (modified by waue, 17 years ago) (diff)
--

hadoop programming 完全公式

輸入 key 輸入 value 輸出 Key 輸出 Value
Mapper < A , B , C , D >
map ( A , B , OutputCollector < C , D > , Reporter reporter )
output . collect ( c , d )
Reducer < C , D , E , F >
reduce ( C , Iterator<D> , OutputCollector < E , F > , Reporter reporter )
output . collect ( e , f )

A, B, C, D ,E, F 分別代表可以用的類別；c, d, e, f 代表由C,D,E,F所產生的物件
有了這張表，我們規劃要寫M/R程式的時候:
- 先把Map的輸入<key,value> 應該屬於哪種類別的，則A,B定好
- Map的輸出<key,value>定好，則 C,D也ok了
- 接下來想最終輸出的<key,value>該為何類別，則 E,F 決定好
- 分別填入 ABCDEF之後，整個程式的架構就出來了，接下來就看你的程式如何實做

舉例<WordCount> :
- 由於輸入為hdfs的路徑，因此傳到mapper裡時，key= 檔案內每一行的位址、value代表檔案內的每一行字串，因此A可以任意類別，B則為Text
- 而wordcount最終要算出的是每個字的出現次數，因此輸出的<key,value>應該是文字，數字，故 E=Text, F=IntWritable
- 如何才能原本一開始每一行字，進而分析成<文字，數字>，故邏輯為先把每一行的單字都取出，然後設定map的輸出為<單字,1>，以便放到Reduce去相加，所以C=Text, D=IntWritable
- 把ABCDEF填入之後，剩下只是程式邏輯而已！

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
  String line = value.toString();
  StringTokenizer tokenizer = new StringTokenizer(line);
  while (tokenizer.hasMoreTokens()) {
    word.set(tokenizer.nextToken());
    output.collect(word, one);
  }
}

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    int sum = 0;
    while (values.hasNext()) {
      sum += values.next().get();
    }
    output.collect(key, new IntWritable(sum));
  }
}

Download in other formats:

Plain Text

		輸入 key		輸入 value		輸出 Key		輸出 Value
Mapper	<	A	,	B	,	C	,	D	>
map	(	A	,	B	,	OutputCollector < C , D >	,	Reporter reporter	)
output	.	collect	(	c	,	d	)
Reducer	<	C	,	D	,	E	,	F	>
reduce	(	C	,	Iterator<D>	,	OutputCollector < E , F >	,	Reporter reporter	)
output	.	collect	(	e	,	f	)