ezoic

Tuesday, June 17, 2014

Notes on Hadoop streaming Python

1. When we write the Hadoop Streaming Python script, we need to make sure that the input, output of the mapper and reducer both  have the key-value structure. Otherwise even the Python script works locally, it won't work on Hadoop streaming. And as long as keep the input and output of the mapper and reducer have the key-value structure, we can throw any algorithm in-between them, and as long as everything else is good, the programming will work fine.

2. When we need to include a local file in the algorithm, when there are several nodes for the system, we need to put the file in every node of the system using -cmdenv option in the shell script since the reducer runs on some node assigned by YARN and it probably can't locate these. Like this:

            -mapper "mapper_senti.py" \
           -reducer "intend5.py" \
           -file mapper_senti.py \
           -file intend5.py \
           -cmdenv dir1=/home/ubuntu/aclImdb/train/pos \
           -cmdenv dir2=/home/ubuntu/aclImdb/train/pos/ \
           -cmdenv dir3=/home/ubuntu/aclImdb/train/neg \
           -cmdenv dir4=/home/ubuntu/aclImdb/train/neg/ \
           -cmdenv dir5=/home/ubuntu/aclImdb/test/pos \
           -cmdenv dir6=/home/ubuntu/aclImdb/test/pos/ \
           -cmdenv dir7=/home/ubuntu/aclImdb/test/neg \
           -cmdenv dir8=/home/ubuntu/aclImdb/test/neg/ \


And in the corresponding Python script, we will include os package, and :

  import os

  dir1=os.environ['dir1']
  dir2=os.environ['dir2']
  dir3=os.environ['dir3']
  dir4=os.environ['dir4']

Please make sure that for ['dir1'] , they are square brackets, not other brackets, like parentheses. 

No comments:

Post a Comment

looking for a man

 I am a mid aged woman. I live in southern california.  I was born in 1980. I do not have any kid. no compliacted dating.  I am looking for ...