There’s so much data being generated in today’s world that we need platforms and frameworks that it’s mind boggling. This field of study is called Big Data Analysis. With so much data lying around, often ranging in petabytes and exabytes, we need super powerful systems to process it. Not only that, we need to do it high efficiency. If you try to do it using your regular ways, you will never be able to do anything in time, let alone doing it in real-time. This is where Apache Spark comes into picture. It is an open source big data processing framework that can process massive amounts of data at high speed using cluster computing. Let’s see how we can install it on Ubuntu.
Prerequisites
The first step is to update the packages:
$ sudo apt-get update
We need to install JRE and JDK. The following command will install the latest versions of OpenJRE and OpenJDK:
$ sudo apt-get install -y default-jre default-jdk
You need to install git (you’ll need it during the build process):
$ sudo apt-get install git
We are ready to proceed with the installation.
Download Spark
Go to this site and choose the following options:
- Choose a Spark release: pick the latest
- Choose a package type: Source code [can build several Hadoop versions]
- Choose a download type: Select Apache mirror
You will see “Download Spark” below it and a link next to it, but note that this is NOT the final download link. Click on this link and it will take you to a webpage. There will be a download link at the top. This is the link we need to use to download:
$ wget http://www.us.apache.org/dist/spark/spark-1.5.1/spark-1.5.1.tgz
Install Scala
Spark is written in Scala, so we need to install Scala to built Spark. Download the latest stable version of Scala from here. Don’t download any versions with “-M1”, “-M2”, etc. Run the following commands to download and place it in the right directory:
$ wget http://www.scala-lang.org/files/archive/scala-2.10.6.tgz $ sudo mkdir /usr/local/src/scala $ sudo tar xvf scala-2.10.6.tgz -C /usr/local/src/scala/
Go to the end of your “~/.bashrc” file and add the following lines:
export SCALA_HOME=/usr/local/src/scala/scala-2.10.6 export PATH=$SCALA_HOME/bin:$PATH
Restart “.bashrc” file:
$ . ~/.bashrc
Check if Scala is installed successfully by running the following command:
$ scala -version
You should see the following on your terminal:
Scala code runner version 2.10.6 -- Copyright 2002-2013, LAMP/EPFL
Build Spark
We are ready to build Spark now. Note that it will take a while to build, so you need to be patient:
$ cd /path/to/spark-1.5.1 $ sbt/sbt assembly
Once it’s done, run the following command to check if everything is good:
$ ./bin/run-example SparkPi 10
A lot of stuff will be printed on terminal. Somewhere in there, you should see “Pi is roughly 3.141108”. It will print all these log messages every time we run something. To avoid that, go into the “spark-1.5.1” directory and run the following command on the terminal:
$ cp conf/log4j.properties.template conf/log4j.properties
Open the newly created “conf/log4j.properties” file and replace the following line:
log4j.rootCategory=INFO, console
by
log4j.rootCategory=ERROR, console
Save the file and exit. Now run the following on your terminal:
$ ./bin/run-example SparkPi 10
You will see only “Pi is roughly 3.141108” printed on the terminal. We are now ready to roll! You can start the Python shell by running the following command:
$ ./bin/pyspark
You can run all the Python commands from this shell to make Spark do all the magic!
———————————————————————————————————
export SCALA_HOME=/usr/local/src/scala/scala-2.10.4
should read
export SCALA_HOME=/usr/local/src/scala/scala-2.10.6
Thanks for pointing it out. Fixed it!