hadoop streaming 실행시 java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 에러가 발생할 때의 팁을 정리했다.
last tool output: |null|
java.io.IOException: Stream closed
at java.lang.ProcessBuilder$NullOutputStream.write(ProcessBuilder.java:434)
at java.io.OutputStream.write(OutputStream.java:116)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
2016-01-22 21:16:56,641 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Stream closed
2016-01-22 21:16:56,641 INFO [main] org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:120)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
2016-01-22 21:16:56,641 WARN [main] org.apache.hadoop.streaming.PipeMapRed: java.io.IOException: Stream closed
2016-01-22 21:16:56,641 INFO [main] org.apache.hadoop.streaming.PipeMapRed: PipeMapRed failed!
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
2016-01-22 21:16:56,643 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
문제는 스크립트 실행에 뭔가가 있다는 점이다.
나름 문제 해결 시퀀스를 정리해봤다.
1. mapper와 reducer에 쓰이는 스크립트 언어라면, !#에 제대로 스크립트 실행 파일를 작성했는지 확인한다.
2. mapper와 reducer가 실행 가능한 chmod 확인한다.
3. 디렉토리 조심
haoop streamging을 처음 쓸 때 디렉토리를 유의해야 한다. -file로 지정하면, -mapper와 -reducer의 파일은 로컬파일에 있는 것처럼 실행시켜야 한다.
#!/bin/sh
...
cmd="hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.5.1.jar \
-D mapred.job.priority=LOW \
-D mapred.job.name=\"TEST\" \
-D mapred.output.compress=true \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-input ${INPUT} \
-output ${OUTPUT} \
-mapper /home/google/mapper.py \
-reducer /home/google/reducer.py \
-file /home/google/mapper.py \
-file /home/google/reducer.py \
-numReduceTasks 1"
eval $cmd
#!/bin/sh
...
cmd="hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.5.1.jar \
-D mapred.job.priority=LOW \
-D mapred.job.name=\"TEST\" \
-D mapred.output.compress=true \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-input ${INPUT} \
-output ${OUTPUT} \
-mapper mapper.py \
-reducer reducer.py \
-file /home/google/mapper.py \
-file /home/google/reducer.py \
-numReduceTasks 1"
eval $cmd
4. 한 디렉토리의 파일만 참조할 때
즉, 아래와 같이 로컬 디렉토리의 파일들을 기반으로 hadoop streaming을 실행할 경우에 해당된다.
#!/bin/sh
...
cmd="hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.5.1.jar \
-D mapred.job.priority=LOW \
-D mapred.job.name=\"TEST\" \
-D mapred.output.compress=true \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-input ${INPUT} \
-output ${OUTPUT} \
-mapper mapper.py \
-reducer reducer.py \
-file ./mapper.py \
-file ./reducer.py \
-numReduceTasks 1"
eval $cmd
#!/bin/sh
...
cmd="hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.5.1.jar \
-D mapred.job.priority=LOW \
-D mapred.job.name=\"TEST\" \
-D mapred.output.compress=true \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-input ${INPUT} \
-output ${OUTPUT} \
-mapper mapper.py \
-reducer reducer.py \
-file ./mapper \
-file ../reduce.py \
-file ../util/log.py \
-numReduceTasks 1"
eval $cmd
이 때, mapper.py에서 /util/log.py를 사용하려고 sys.path.append() 함수를 사용할 수 있었지만, 최신 버전에서는 조심해야 한다.
hadoop에서는 이런게 꼼꼼하다...
위의 1 번째 발생 사례 케이스는 아닌데..
hadoop 웹 페이지에 접속해서 에러 로그를 확인했더니. 라이브러리가 없다고 나온다. data node에 필요한 라이브러리 설치한다.