使用Python提交Hadoop Streaming作业-开源大数据平台 E-MapReduce-阿里云

本文为您介绍如何使用Python提交Hadoop Streaming作业。

前提条件

已在E-MapReduce控制台上创建Hadoop集群。

创建集群详情，请参见创建集群。

操作步骤

通过SSH方式连接集群，详情请参见使用SSH连接主节点。
新建文件mapper.py。
1. 执行以下命令，打开文件mapper.py。
```
vim /home/hadoop/mapper.py
```
2. 按下i键进入编辑模式。
3. 在mapper.py文件中添加以下信息。
```
#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print '%s\t%s' % (word, 1)
```
4. 按下Esc键退出编辑模式，输入:wq保存并关闭文件。

新建文件reducer.py。

执行以下命令，打开文件reducer.py。
```
vim /home/hadoop/reducer.py
```
按下i键进入编辑模式。

在reducer.py文件中添加以下信息。

#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    try:
        count = int(count)
    except ValueError:
        continue
    if current_word == word:
        current_count += count
    else:
        if current_word:
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word
if current_word == word:
    print '%s\t%s' % (current_word, current_count)

按下Esc键退出编辑模式，输入:wq保存并关闭文件。

执行以下命令，上传hosts文件到HDFS。
```
hdfs dfs -put /etc/hosts /tmp/
```

执行以下命令，提交Hadoop Streaming作业。

hadoop jar /usr/lib/hadoop-current/share/hadoop/tools/lib/hadoop-streaming-X.X.X.jar -file /home/hadoop/mapper.py -mapper mapper.py -file /home/hadoop/reducer.py -reducer reducer.py -input /tmp/hosts -output /tmp/output


参数	描述
input	输入路径，本示例为/tmp/hosts。
output	输出路径，本示例为/tmp/output。

说明 hadoop-streaming-X.X.X.jar中的X.X.X表示JAR包的具体版本号，需要根据实际集群中Hadoop的版本来修改。您可以在/usr/lib/hadoop-current/share/hadoop/tools/lib/目录下查看JAR包具体版本号。