Easy Programming: What's Apache Spark, and how it works II

Tuesday, December 6, 2016

What's Apache Spark, and how it works II

First two Apache Spark Tutorials:

http://www.cloudera.com/documentation/enterprise/5-6-x/PDF/cloudera-spark.pdf

https://www.tutorialspoint.com/apache_spark/apache_spark_tutorial.pdf

Spark works on Big data. It is Open Source.

Here is how to install Spark on Ubuntu:

http://blog.prabeeshk.com/blog/2014/10/31/install-apache-spark-on-ubuntu-14-dot-04/

It is considered to be the successor to MapReduce for general purpose data processing on Apache Hadoop clusters. In MapReduce the highest-level unit of computation is a job. In Spark, the highest-level unit of computation is an application.

It exposes APIs for Java, Python and Scala and consists of Spark core and several related projects.

Spark SQL-Module for working with structured data. Allows you to seamlessly mix SQL queries with Spark programs

Spark Streaming-API that allows you to build scalable fault-tolerant streaming applications.

MLlib-API that implements common machine learning algorithms.

GraphX-API for graphs and graph-parallel computation.

Here is an example of how an Apache Spark application works.

The simplest way to run a Spark application is by using the Scala or Python shells.

1. To start one of the shell applications, run one of the following commands:

-Scala
$SPARK_HOME/bin/spark-shell

-Python
$SPARK_HOME/bin/pyspark

2. To run the classic Hadoop word count application, copy an input file to HDFS:

$hdfs dfs -put input

3. Within a shell, run the word count application using the following code exmaples, submitting for namenode_host, path/to/input, and path/to/output

Scala:

scala > val myfile=sc.textFile("hdfs://namenode_host:8020/path/to/input")
scala > val counts=myfile.flatMap (line=>lin.split(" ").map(word =>(word,1)).reduceByKey(_+_)
scala > counts.saveAsTextFile("hdfs://namenode:8020/path/to/output")

Python:

>>>myfile =sc.textFile("hdfs://namenode_host:8020/path/to/input")
>>>counts=myfile.flatMap(lambda line : line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2:v1+v2)
>>>counts.saveAsTextFile("hdfs://namenode:8020/path/to/output")

The above code works on core Spark. We can also build Spark applications.

Here are some Spark books:

Mastering Apache Spark by Mike Frampton

https://www.amazon.com/Mastering-Apache-Spark-Mike-Frampton-ebook/dp/B0119R8J00

Spark Cookbook by Rishi Yadav

https://www.amazon.com/Rishi-Yadav/e/B012UW5VZE/ref=pd_sim_351_bl_3?_encoding=UTF8&refRID=TNS2TCMGB0KY4NNM9MJ9

Easy Programming

ezoic

Tuesday, December 6, 2016

What's Apache Spark, and how it works II

No comments:

Post a Comment

looking for a man

Followers