Hadoop生态圈python + mapreduce + wordcount

大家好，又见面了，我是你们的朋友全栈君。

Hadoop生态圈python + mapreduce + wordcount

启动hadoop进度

发布文件

hdfs dfs -put /home/hadoop/hadoop/input /user/hadoop/input

查看hdfs现在有一些文件

[hadoop@master0 hadoop]$ hdfs dfs -ls / Found 1 items drwxr-xr-x - hadoop supergroup 0 2019-12-04 02:17 /user

经验| Hadoop生态圈python + mapreduce + wordcount

查看上传的文件是否正确

经验| Hadoop生态圈python + mapreduce + wordcount

运行程序，查询字符串出现次数

查看输出结果

[hadoop@master0 hdfs]$ hdfs dfs -cat /user/hadoop/output/* work, 63 worker 315 would 62 write-operations. 62 written 62 ........ ....... .......

编写mapreduce编程，推送到流中进行运算

#!/usr/bin/env Python3 # -*- coding: utf-8 -*- # @Software: PyCharm # @virtualenv：workon # @contact: 1040691703@qq.com # @Desc：Code descripton __author__ = '未昔/AngelFate' __date__ = '2019/12/4 20:20' import sys for line in sys.stdin: line = line.strip() words = line.split() for word in words: print("%s\t%s"%(word,1))

#!/usr/bin/env Python3 # -*- coding: utf-8 -*- # @Software: PyCharm # @virtualenv：workon # @contact: 1040691703@qq.com # @Desc：Code descripton __author__ = '未昔/AngelFate' __date__ = '2019/12/4 20:25' import sys current_word = None #记录前一个单词， 用于比较 count = 0 word = None current_count = 0 for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) try: count = int(count) except ValueError: continue if current_word == word: current_count += count else: if current_word: print("%s\t%s" % (current_word, current_count)) current_count = count current_word = word if word == current_word: print("%s\t%s"%(word,count))

[hadoop@master0 hadoop]$ bin/hadoop jar\ share/hadoop/tools/lib/****.jar \ -file mapper.py -mapper "python mapper.py" \ -file reducer.py -reducer "python reducer.py" \ -input /user/hadoop/input -output /user/hadoop/input

[hadoop @ master0 hadoop] $ hadoop fs -cat input / part-00000
经验| Hadoop生态圈python + mapreduce + wordcount

(API), 62 (C++, 62 (FUSE) 62 (HDFS) 62 (HDFS). 63 (JAR) 63 (JRE) 63 (RPC) 62 (SPoF), 62 (a,b,c) 62 (a,b,c), 62 (but 62 (either 63 (more 63 (ssh) 63 (the 62 (typically 62 (webapp) 62 (x,y,z) 62 (x,y,z). 62 1.6 63 2.0 62 2012,[66] 62 3, 62 3rd-party 62 A 369 API 124 ARchive 63 An 62 Append. 62 B 124 Because 62 C#, 62 Clients 62 Cocoa, 62 Common 126 Data 62 DataNode 126 DataNode. 63 Distributed 63 Each 62 Environment 63 Erlang, 62 Failure 62 Federation, 62 File 182 Filesystem 62 For 119 HDFS 732 HDFS, 62 HDFS-UI 62 HDFS. 62 HTTP, 62 Hadoop 986 Hadoop-compatible 63 Hadoop. 63 Haskell, 62 I/O 62 In 121 It 62 Java 312 Java, 62 Job 63 JobTracker 63 Linux 62 MapReduce 126 MapReduce/MR1 63 May 62 Moreover, 62 NameNode 126 NameNode) 62 NameNode, 189 OCaml), 62 OS 63 PHP, 62 POSIX 124 POSIX-compliant 62 POSIX-compliant, 62 Perl, 62 Point 62 Python, 62 RAID 124 Ruby, 62 Runtime 63 Secure 63 Shell 63 Similarly, 63 Single 62 Smalltalk, 62 Some 62 System 63 TCP/IP 62 Task 63 TaskTracker, 63 The 552 These 125 This 187 Thrift 62 Tracker, 126 Unix 62 Userspace 62 Web 62 When 125 With 62 YARN/MR2)[58] 63 a 1994 abstractions, 63 access 62 achieved 62 achieves 62 across 250 actions, 62 acts 63 added 62 addition, 62 advantage 124 aims 62 allowing 62 also 62 alternate 63 although 62 always 62 amount 62 an 187 and 1747 and, 63 announced 62 application 124 application. 62 applications 63 applications. 63 approach 63 architecture 63 are 437 around, 62 as 249 automatic 62 available 62 available. 125 awareness 124 awareness: 63 backbone 63 backup 62 backup. 62 be 497 because 62 become 62 been 62 between 187 block 62 blocks 62 both 63 bottleneck 124 browsed 62 builds 62 but 62 by 249 call 62 can 561 capabilities, 62 certain 62 checkpointed 62 choosing 62 client 124 cluster 376 cluster, 63 code 63 command-line 62 commands 62 communicate 62 communication. 62 compliance 62 compute-only 63 concurrent 62 configurations 62 connects 62 consider 62 consists 126 contains 187 copies 62 corruption 63 create 62 criticality. 62 data 934 data, 62 data-intensive 62 data-only 63 data. 63 datanode 62 datanodes, 62 dedicated 63 default 62 demonstrated 62 designed 62 developing 62 differ 62 different 62 directly 62 directories. 62 directory 124 distributed 124 distributed, 62 does 124 due 124 each 124 edit 62 effective 63 engine 63 entire 62 equivalents. 63 especially 62 every 63 example: 62 execute 63 extent 62 fact, 62 fail 62 fail-over. 62 failed 62 failing 63 failure; 63 failures 63 file 810 file-system 249 file-system-specific 63 files 125 files, 62 files. 62 files[65] 62 for 869 framework. 62 from 62 fully 124 generate 125 gigabytes 62 goals 62 goes 124 hardware 63 has 186 have 125 having 124 hence 62 high-availability 62 high. 62 higher. 63 host 63 hosts 62 hosts, 62 huge 124 if 125 images 62 immutable 62 impact 125 in 560 inability 62 includes 125 incorrectly 62 increase 62 increased 62 index, 63 information 63 information, 62 instead 62 interface 62 interface, 62 interpret 62 is 622 is, 63 is. 63 issue, 62 issues 62 it 187 its 124 job 249 job-completion 62 jobs 62 jobs. 62 journal 62 keep 62 lack 62 language 62 large 124 larger 63 letting 62 level 63 libraries.A 6 libraries.File 6 libraries.For 6 libraries.HDFS 15 libraries.Hadoop 16 libraries.In 4 libraries.The 9 local 62 location 63 location. 62 log 62 loss 63 machines. 62 main 62 manage 63 managed 63 management 62 manually 62 map 186 master 126 may 62 memory 63 metadata 124 metadata, 62 method 63 methods 62 might 62 misleading 62 mostly 62 mounted 62 mounted,[62] 62 move 62 multi-node 63 multiple 312 name 125 namely, 62 namenode 496 namenode's 125 namenode, 62 namenodes. 62 namespaces 62 native 62 necessary 63 needed 63 network 249 new 62 node 500 nodes 251 nodes. 189 nodes: 62 nominally 62 non-POSIX 62 nonstandard 63 normally 63 not 310 number 124 occurs, 63 of 1622 offline. 62 on 622 one 125 only 63 operations 62 options 62 or 562 other 248 other. 62 outage 63 over 248 package 63 package, 63 perform 124 performance 124 plus 62 point 62 portable 62 possible 63 power 63 precisely, 63 preventing 63 prevents 62 primary 248 problem 62 problem, 62 procedure 62 programming 62 project 62 protocol 62 provide 125 provides 63 rack 126 rack, 62 rack. 62 rack/switch 63 racks. 63 range 62 rebalance 62 reduce 249 reduces 125 redundancy 125 regularly 62 release 62 reliability 62 remain 63 remote 124 replaced 63 replay 62 replicating 125 replication 124 request. 62 require 125 requirements 62 requires 63 requiring 62 restart 62 running 62 same 125 saves 62 scalability 62 scalable, 62 scheduled 62 schedules 124 scheduling 126 scripts 126 secondary 250 separate 62 served 62 server 188 serves 62 set 63 shell 62 should 63 shutdown 63