全部产品
存储与CDN 数据库 安全 应用服务 数加·人工智能 数加·大数据基础服务 互联网中间件 视频服务 开发者工具 解决方案 物联网
E-MapReduce

Hadoop Streaming

更新时间:2017-06-07 13:26:11

python 写hadoop streaming作业

mapper代码如下

  1. #!/usr/bin/env python
  2. import sys
  3. for line in sys.stdin:
  4. line = line.strip()
  5. words = line.split()
  6. for word in words:
  7. print '%s\t%s' % (word, 1)

reducer代码如下

  1. #!/usr/bin/env python
  2. from operator import itemgetter
  3. import sys
  4. current_word = None
  5. current_count = 0
  6. word = None
  7. for line in sys.stdin:
  8. line = line.strip()
  9. word, count = line.split('\t', 1)
  10. try:
  11. count = int(count)
  12. except ValueError:
  13. continue
  14. if current_word == word:
  15. current_count += count
  16. else:
  17. if current_word:
  18. print '%s\t%s' % (current_word, current_count)
  19. current_count = count
  20. current_word = word
  21. if current_word == word:
  22. print '%s\t%s' % (current_word, current_count)

假设mapper代码保存在/home/hadoop/mapper.py, reducer代码保存在/home/hadoop/reducer.py , 输入路径为hdfs文件系统的/tmp/input,输出路径为hdfs文件系统的/tmp/output。则在E-MapReduce集群上提交下面的hadoop命令

hadoop jar /usr/lib/hadoop-current/share/hadoop/tools/lib/hadoop-streaming-*.jar -file /home/hadoop/mapper.py -mapper mapper.py -file /home/hadoop/reducer.py -reducer reducer.py -input /tmp/hosts -output /tmp/output

本文导读目录