2015年8月1日 星期六

Hadoop: MapReduce WordCount 範例


承續前一篇 Hadoop: 如何在 CentOS 7.1.1503 安裝 Hadoop 2.7.1 (Single-Node Cluster) , 本文將繼續就 MapReduce 範例作記錄.

以下範例, 源自參考文件的前 3 篇, 本文只是將筆者的操作過程作記錄, 供後續安裝的參考.


1. MapReduce 範例資料下載
2. 將範例資料上傳至 Hadoop
3. 檢查範例資料是否正常上傳至 Hadoop
4. 以 WordCount 作為資料分析範例

1. MapReduce 範例資料下載

前3篇參考文件的原作者, 有提供 3 個 URL 供下載範例資料 (請選取 Plain Text UTF-8 的檔案) , 將資料下載至 samples/gutenberg 子資料夾

The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
The Notebooks of Leonardo Da Vinci
Ulysses by James Joyce

[root@localhost ~]# [hadoop@localhost ~]$ mkdir samples/gutenberg
[hadoop@localhost ~]$ cd samples/gutenberg

[hadoop@localhost samples]$ wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt
--2015-07-29 10:47:06--  http://www.gutenberg.org/cache/epub/20417/pg20417.txt
正在查找主機 www.gutenberg.org (www.gutenberg.org)...
正在連接 www.gutenberg.org (www.gutenberg.org)||:80... 連上了。
已送出 HTTP 要求,正在等候回應... 200 OK
長度: 674570 (659K) [text/plain]
Saving to: ‘pg20417.txt’

100%[======================================>] 674,570     31.2KB/s   in 28s    

2015-07-29 10:47:35 (23.2 KB/s) - ‘pg20417.txt’ saved [674570/674570]

[hadoop@localhost samples]$ wget http://www.gutenberg.org/files/5000/5000-8.txt
--2015-07-29 10:48:38--  http://www.gutenberg.org/files/5000/5000-8.txt
正在查找主機 www.gutenberg.org (www.gutenberg.org)...
正在連接 www.gutenberg.org (www.gutenberg.org)||:80... 連上了。
已送出 HTTP 要求,正在等候回應... 200 OK
長度: 1428841 (1.4M) [text/plain]
Saving to: ‘5000-8.txt’

100%[======================================>] 1,428,841   55.1KB/s   in 15s    

2015-07-29 10:48:53 (95.2 KB/s) - ‘5000-8.txt’ saved [1428841/1428841]

[hadoop@localhost samples]$ wget http://www.gutenberg.org/cache/epub/4300/pg4300.txt
--2015-07-29 10:49:40--  http://www.gutenberg.org/cache/epub/4300/pg4300.txt
正在查找主機 www.gutenberg.org (www.gutenberg.org)...
正在連接 www.gutenberg.org (www.gutenberg.org)||:80... 連上了。
已送出 HTTP 要求,正在等候回應... 200 OK
長度: 1573151 (1.5M) [text/plain]
Saving to: ‘pg4300.txt’

100%[======================================>] 1,573,151   46.3KB/s   in 38s    

2015-07-29 10:50:18 (40.5 KB/s) - ‘pg4300.txt’ saved [1573151/1573151]

[hadoop@localhost samples]$ 

2. 將範例資料上傳至 Hadoop

  • 建立執行 MapReduce jobs 所需要的 HDFS 目錄
[hadoop@localhost hadoop]$ echo $HOME
[hadoop@localhost hadoop]$ cd $HADOOP_HOME
[hadoop@localhost hadoop]$ pwd
[hadoop@localhost hadoop]$ bin/hdfs dfs -mkdir /user    (注意: 這裡建立的不是 Linux 上的, 而是 Hadoop 內部本身的)
[hadoop@localhost hadoop]$ bin/hdfs dfs -mkdir /user/hadoop
[hadoop@localhost hadoop]$ bin/hdfs dfs -mkdir /user/hadoop/gutenberg

[hadoop@localhost ~]$ sudo tar xzf hadoop-2.7.1.tar.gz
[sudo] password for hadoop:

  • 將檔案丟到 HDFS 目錄
[hadoop@localhost hadoop]$ bin/hadoop fs -put $HOME/samples/gutenberg/*.txt /user/hadoop/gutenberg
[hadoop@localhost hadoop]$

3. 檢查範例資料是否正常上傳至 Hadoop

  • 檢查範例資料是否正常上傳至 Hadoop
[hadoop@localhost /]$ cd $HADOOP_HOME
[hadoop@localhost hadoop]$ bin/hadoop fs -ls /user/hadoop
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2015-07-29 11:25 /user/hadoop/gutenberg
[hadoop@localhost hadoop]$ bin/hadoop fs -ls /user/hadoop/gutenberg
Found 3 items
-rw-r--r--   1 hadoop supergroup    1428841 2015-07-29 11:25 /user/hadoop/gutenberg/5000-8.txt
-rw-r--r--   1 hadoop supergroup     674570 2015-07-29 11:25 /user/hadoop/gutenberg/pg20417.txt
-rw-r--r--   1 hadoop supergroup    1573151 2015-07-29 11:25 /user/hadoop/gutenberg/pg4300.txt
[hadoop@localhost hadoop]$ 

4. 以 WordCount 作為資料分析範例

  • 執行 MapReduce 程式
[hadoop@localhost hadoop]$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hadoop/gutenberg /user/hadoop/gutenberg-output
Not a valid JAR: /usr/local/hadoop/hadoop*examples*.jar
[hadoop@localhost hadoop]$ find . -name \*examples*.jar -print
[hadoop@localhost hadoop]$ bin/hadoop jar share/hadoop/mapreduce/hadoop*examples*.jar wordcount /user/hadoop/gutenberg /user/hadoop/gutenberg-output
15/07/29 11:41:33 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/07/29 11:41:33 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/07/29 11:41:35 INFO input.FileInputFormat: Total input paths to process : 3
15/07/29 11:41:35 INFO mapreduce.JobSubmitter: number of splits:3
15/07/29 11:41:36 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local170720327_0001
15/07/29 11:41:37 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/07/29 11:41:37 INFO mapreduce.Job: Running job: job_local170720327_0001
15/07/29 11:41:37 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/07/29 11:41:37 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/07/29 11:41:37 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
15/07/29 11:41:38 INFO mapreduce.Job: Job job_local170720327_0001 running in uber mode : false
15/07/29 11:41:38 INFO mapreduce.Job:  map 0% reduce 0%
15/07/29 11:41:38 INFO mapred.LocalJobRunner: Waiting for map tasks
15/07/29 11:41:38 INFO mapred.LocalJobRunner: Starting task: attempt_local170720327_0001_m

  • 檢查一下輸出
[hadoop@localhost hadoop]$ bin/hadoop fs -ls /user/hadoop/gutenberg-outputFound 2 items
-rw-r--r--   1 hadoop supergroup          0 2015-07-29 11:42 /user/hadoop/gutenberg-output/_SUCCESS
-rw-r--r--   1 hadoop supergroup     883509 2015-07-29 11:42 /user/hadoop/gutenberg-output/part-r-00000

  • 把輸出抓回來
[hadoop@localhost hadoop]$ bin/hadoop fs -get /user/hadoop/gutenberg-output $HOME/samples/gutenberg-output
15/07/29 11:49:36 WARN hdfs.DFSClient: DFSInputStream has been closed already
15/07/29 11:49:36 WARN hdfs.DFSClient: DFSInputStream has been closed already
[hadoop@localhost hadoop]$ 

  • 檢視抓回來的輸出
[hadoop@localhost hadoop]$ cat $HOME/samples/gutenberg-output/*
youth--devoted 1
youth. 4
youth." 1
youth.] 1
youth; 1
youth? 2
youth_ 1
youthful 10
youthfulness, 1
youths 1
youve 2


終於把 MapReduce 的範例執行過一次, 這只是最基本的, 後面還有不少東西要 Study ...



