What's Apache Spark, and how it works II

Spark works on Big data. It is Open Source.

It is considered to be the successor to MapReduce for general purpose data processing on Apache Hadoop clusters. In MapReduce the highest-level unit of computation is a job. In Spark, the highest-level unit of computation is an application.

It exposes APIs for Java, Python and Scala and consists of Spark core and several related projects.

Spark SQL-Module for working with structured data. Allows you to seamlessly mix SQL queries with Spark programs

Spark Streaming-API that allows you to build scalable fault-tolerant streaming applications.

MLlib-API that implements common machine learning algorithms.

GraphX-API for graphs and graph-parallel computation.

Here is an example of how an Apache Spark application works.

The simplest way to run a Spark application is by using the Scala or Python shells.

1. To start one of the shell applications, run one of the following commands:



2. To run the classic Hadoop word count application, copy an input file to HDFS:

$hdfs dfs -put input

3. Within a shell, run the word count application using the following code exmaples, submitting for namenode_host, path/to/input, and path/to/output


scala > val myfile=sc.textFile("hdfs://namenode_host:8020/path/to/input")
scala > val counts=myfile.flatMap (line=>lin.split(" ").map(word =>(word,1)).reduceByKey(_+_)
scala > counts.saveAsTextFile("hdfs://namenode:8020/path/to/output")


>>>myfile =sc.textFile("hdfs://namenode_host:8020/path/to/input")
>>>counts=myfile.flatMap(lambda line : line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda v1,v2:v1+v2)

The above code works on core Spark. We can also build Spark applications.

