緣起
承續前一篇 Hadoop: 如何在 CentOS 7.1.1503 安裝 Hadoop 2.7.1 (Single-Node Cluster) , 本文將繼續就 MapReduce 範例作記錄.以下範例, 源自參考文件的前 3 篇, 本文只是將筆者的操作過程作記錄, 供後續安裝的參考.
以下將分為4個部份:
1. MapReduce 範例資料下載
2. 將範例資料上傳至 Hadoop
3. 檢查範例資料是否正常上傳至 Hadoop
4. 以 WordCount 作為資料分析範例
1. MapReduce 範例資料下載
前3篇參考文件的原作者, 有提供 3 個 URL 供下載範例資料 (請選取 Plain Text UTF-8 的檔案) , 將資料下載至 samples/gutenberg 子資料夾* The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
* The Notebooks of Leonardo Da Vinci
* Ulysses by James Joyce
[root@localhost ~]# [hadoop@localhost ~]$ mkdir samples/gutenberg [hadoop@localhost ~]$ cd samples/gutenberg [hadoop@localhost samples]$ wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt --2015-07-29 10:47:06-- http://www.gutenberg.org/cache/epub/20417/pg20417.txt 正在查找主機 www.gutenberg.org (www.gutenberg.org)... 152.19.134.47 正在連接 www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... 連上了。 已送出 HTTP 要求,正在等候回應... 200 OK 長度: 674570 (659K) [text/plain] Saving to: ‘pg20417.txt’ 100%[======================================>] 674,570 31.2KB/s in 28s 2015-07-29 10:47:35 (23.2 KB/s) - ‘pg20417.txt’ saved [674570/674570] [hadoop@localhost samples]$ wget http://www.gutenberg.org/files/5000/5000-8.txt --2015-07-29 10:48:38-- http://www.gutenberg.org/files/5000/5000-8.txt 正在查找主機 www.gutenberg.org (www.gutenberg.org)... 152.19.134.47 正在連接 www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... 連上了。 已送出 HTTP 要求,正在等候回應... 200 OK 長度: 1428841 (1.4M) [text/plain] Saving to: ‘5000-8.txt’ 100%[======================================>] 1,428,841 55.1KB/s in 15s 2015-07-29 10:48:53 (95.2 KB/s) - ‘5000-8.txt’ saved [1428841/1428841] [hadoop@localhost samples]$ wget http://www.gutenberg.org/cache/epub/4300/pg4300.txt --2015-07-29 10:49:40-- http://www.gutenberg.org/cache/epub/4300/pg4300.txt 正在查找主機 www.gutenberg.org (www.gutenberg.org)... 152.19.134.47 正在連接 www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... 連上了。 已送出 HTTP 要求,正在等候回應... 200 OK 長度: 1573151 (1.5M) [text/plain] Saving to: ‘pg4300.txt’ 100%[======================================>] 1,573,151 46.3KB/s in 38s 2015-07-29 10:50:18 (40.5 KB/s) - ‘pg4300.txt’ saved [1573151/1573151] [hadoop@localhost samples]$
2. 將範例資料上傳至 Hadoop
- 建立執行 MapReduce jobs 所需要的 HDFS 目錄
[hadoop@localhost hadoop]$ echo $HOME /home/hadoop [hadoop@localhost hadoop]$ cd $HADOOP_HOME [hadoop@localhost hadoop]$ pwd /usr/local/hadoop [hadoop@localhost hadoop]$ bin/hdfs dfs -mkdir /user (注意: 這裡建立的不是 Linux 上的, 而是 Hadoop 內部本身的) [hadoop@localhost hadoop]$ bin/hdfs dfs -mkdir /user/hadoop [hadoop@localhost hadoop]$ bin/hdfs dfs -mkdir /user/hadoop/gutenberg [hadoop@localhost ~]$ sudo tar xzf hadoop-2.7.1.tar.gz [sudo] password for hadoop:
- 將檔案丟到 HDFS 目錄
[hadoop@localhost hadoop]$ bin/hadoop fs -put $HOME/samples/gutenberg/*.txt /user/hadoop/gutenberg [hadoop@localhost hadoop]$
3. 檢查範例資料是否正常上傳至 Hadoop
- 檢查範例資料是否正常上傳至 Hadoop
[hadoop@localhost /]$ cd $HADOOP_HOME [hadoop@localhost hadoop]$ bin/hadoop fs -ls /user/hadoop Found 1 items drwxr-xr-x - hadoop supergroup 0 2015-07-29 11:25 /user/hadoop/gutenberg [hadoop@localhost hadoop]$ bin/hadoop fs -ls /user/hadoop/gutenberg Found 3 items -rw-r--r-- 1 hadoop supergroup 1428841 2015-07-29 11:25 /user/hadoop/gutenberg/5000-8.txt -rw-r--r-- 1 hadoop supergroup 674570 2015-07-29 11:25 /user/hadoop/gutenberg/pg20417.txt -rw-r--r-- 1 hadoop supergroup 1573151 2015-07-29 11:25 /user/hadoop/gutenberg/pg4300.txt [hadoop@localhost hadoop]$
4. 以 WordCount 作為資料分析範例
- 執行 MapReduce 程式
[hadoop@localhost hadoop]$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hadoop/gutenberg /user/hadoop/gutenberg-output Not a valid JAR: /usr/local/hadoop/hadoop*examples*.jar
[hadoop@localhost hadoop]$ find . -name \*examples*.jar -print ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar ./share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.7.1-sources.jar ./share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.7.1-test-sources.jar
[hadoop@localhost hadoop]$ bin/hadoop jar share/hadoop/mapreduce/hadoop*examples*.jar wordcount /user/hadoop/gutenberg /user/hadoop/gutenberg-output 15/07/29 11:41:33 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 15/07/29 11:41:33 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 15/07/29 11:41:35 INFO input.FileInputFormat: Total input paths to process : 3 15/07/29 11:41:35 INFO mapreduce.JobSubmitter: number of splits:3 15/07/29 11:41:36 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local170720327_0001 15/07/29 11:41:37 INFO mapreduce.Job: The url to track the job: http://localhost:8080/ 15/07/29 11:41:37 INFO mapreduce.Job: Running job: job_local170720327_0001 15/07/29 11:41:37 INFO mapred.LocalJobRunner: OutputCommitter set in config null 15/07/29 11:41:37 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1 15/07/29 11:41:37 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 15/07/29 11:41:38 INFO mapreduce.Job: Job job_local170720327_0001 running in uber mode : false 15/07/29 11:41:38 INFO mapreduce.Job: map 0% reduce 0% 15/07/29 11:41:38 INFO mapred.LocalJobRunner: Waiting for map tasks 15/07/29 11:41:38 INFO mapred.LocalJobRunner: Starting task: attempt_local170720327_0001_m ...
- 檢查一下輸出
[hadoop@localhost hadoop]$ bin/hadoop fs -ls /user/hadoop/gutenberg-outputFound 2 items -rw-r--r-- 1 hadoop supergroup 0 2015-07-29 11:42 /user/hadoop/gutenberg-output/_SUCCESS -rw-r--r-- 1 hadoop supergroup 883509 2015-07-29 11:42 /user/hadoop/gutenberg-output/part-r-00000
- 把輸出抓回來
[hadoop@localhost hadoop]$ bin/hadoop fs -get /user/hadoop/gutenberg-output $HOME/samples/gutenberg-output 15/07/29 11:49:36 WARN hdfs.DFSClient: DFSInputStream has been closed already 15/07/29 11:49:36 WARN hdfs.DFSClient: DFSInputStream has been closed already [hadoop@localhost hadoop]$
- 檢視抓回來的輸出
[hadoop@localhost hadoop]$ cat $HOME/samples/gutenberg-output/* ... youth--devoted 1 youth. 4 youth." 1 youth.] 1 youth; 1 youth? 2 youth_ 1 youthful 10 youthfulness, 1 youths 1 youve 2
總結
終於把 MapReduce 的範例執行過一次, 這只是最基本的, 後面還有不少東西要 Study ...參考文件
- [hadoop]Hadoop 小象安裝測試在 Ubuntu
- [hadoop]Hadoop 小象安裝測試在 Ubuntu (2)
- Running Hadoop on Ubuntu Linux (Single-Node Cluster)
- [研究] Hadoop 2.6.0 Single Cluster 安裝 (CentOS 7.0 x86_64)
- Hadoop File System Shell Command Reference
.
沒有留言:
張貼留言