1. When we write the Hadoop Streaming Python script, we need to make sure that the input, output of the mapper and reducer both have the key-value structure. Otherwise even the Python script works locally, it won't work on Hadoop streaming. And as long as keep the input and output of the mapper and reducer have the key-value structure, we can throw any algorithm in-between them, and as long as everything else is good, the programming will work fine.
2. When we need to include a local file in the algorithm, when there are several nodes for the system, we need to put the file in every node of the system using -cmdenv option in the shell script since the reducer runs on some node assigned by YARN and it probably can't locate these. Like this:
-mapper "mapper_senti.py" \
-reducer "intend5.py" \
-file mapper_senti.py \
-file intend5.py \
-cmdenv dir1=/home/ubuntu/aclImdb/train/pos \
-cmdenv dir2=/home/ubuntu/aclImdb/train/pos/ \
-cmdenv dir3=/home/ubuntu/aclImdb/train/neg \
-cmdenv dir4=/home/ubuntu/aclImdb/train/neg/ \
-cmdenv dir5=/home/ubuntu/aclImdb/test/pos \
-cmdenv dir6=/home/ubuntu/aclImdb/test/pos/ \
-cmdenv dir7=/home/ubuntu/aclImdb/test/neg \
-cmdenv dir8=/home/ubuntu/aclImdb/test/neg/ \
And in the corresponding Python script, we will include os package, and :
import os
dir1=os.environ['dir1']
dir2=os.environ['dir2']
dir3=os.environ['dir3']
dir4=os.environ['dir4']
Please make sure that for ['dir1'] , they are square brackets, not other brackets, like parentheses.
I wrote about the solutions to some problems I found from programming and data analytics. They may help you on your work. Thank you.
ezoic
Subscribe to:
Posts (Atom)
looking for a man
I am a mid aged woman. I was born in 1980. I do not have any kid. no complicated dating before . I am looking for a man here for marriage...
-
I tried to commit script to bitbucket using sourcetree. I first cloned from bitbucket using SSH, and I got an error, "authentication ...
-
https://github.com/boto/boto3/issues/134 import boto3 import botocore client = boto3.client('s3') result = client.list_obje...
-
Previously, I wanted to install "script" on Atom to run PHP. And there was some problem, like the firewall. So I tried atom-runner...