In the previous blog post, we saw how to start a Spark cluster on EC2 using the inbuilt launch scripts. This is good if you want get something up and running quickly, but it won’t allow fine-grained control over our cluster. A lot of times, we would want to customize the machines that we spin up. Let’s say that you want to use different types of machines to handle production level traffic in different regions. May be you are not on EC2 and you want to launch some machines in your cluster. How would you do it? This is the reason we have Spark Standalone mode. Using this method, we can manually launch any number of machines independently in our private cluster and make them listen to our master machine. It gives us a lot of flexibility! Let’s go ahead and see how to do it, shall we?
Prerequisites
We will be using EC2 to launch our cluster. Let’s go ahead and start two instances on EC2. We will be using one instance for our master and the other one for slave. Let’s quickly update the packages on both the machines:
$ sudo apt-get update
It’s better to install the latest versions of OpenJRE and OpenJDK as well:
$ sudo apt-get install -y default-jre default-jdk
Download Spark
In the last blog post, we discussed how to build Spark from the ground up. If you just want to run an application, we don’t need to actually build it. We can just download a pre-built version of Spark and run our application. Go to this site and choose the following options:
- Choose a Spark release: pick the latest
- Choose a package type: Pre-built for Hadoop 2.6 and later
- Choose a download type: Select Apache mirror
- Download Spark: Note that this is NOT the final download link, so don’t use it to wget anything just yet. Click on this link and it will take you to a webpage. There will be a download link at the top. Copy this link.
Download it on the both machines using the following command:
$ wget http://mirror.cogentco.com/pub/apache/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz
Uncompress it using the following command:
$ tar -xvf spark-1.5.1-bin-hadoop2.6.tgz
You should see a folder named “spark-1.5.1-bin-hadoop2.6”.
Check if Spark is working
On each machine, go into the “spark-1.5.1-bin-hadoop2.6” folder and run the following command:
$ ./bin/run-example SparkPi 14
A lot of stuff will be printed on terminal. Somewhere in there, you should see “Pi is roughly 3.141108”. It means that Spark is working on your machine. If you don’t want all those log messages to be printed on the terminal, you can simply disable it. You can check my previous blog post to see how to do it.
Starting the machines in master/slave mode
Let’s go into the master machine and start it in the “master” mode. We just need to tell this machine that this will be the master machine controlling our cluster. When we run an application, the master machine will distribute the workload across all the slave machines. In the master machine, go into “spark-1.5.1-bin-hadoop2.6” folder and run the following command:
$ ./sbin/start-master.sh
Go to http://MASTER-IP:8080 on your browser. At the top, you will see something like this:
URL: spark://ip-172-34-12-158:7077
This will be your master spark URL. You will be needing this later to run your applications. This is the GUI of your master machine when you run your applications.
Let’s go start the slave machine in the “worker” mode. We just need to make this machine listen to our master machine. In the slave machine, go to “spark-1.5.1-bin-hadoop2.6” folder and run the following:
$ ./sbin/start-slave.sh spark://ip-172-34-12-158:7077
As you can see, “spark://ip-172-34-12-158:7077” is the Spark master URL. You will see a message like “starting org.apache.spark.deploy.worker.Worker ….” printed on the terminal. Go to http://SLAVE-IP:8081 and you will see something like this at the top:
ID: worker-20151123231202-172.32.4.115-50700 Master URL: spark://ip-172-34-12-158:7077
This means that the slave machine is now listening to the master and the cluster is ready to go!
Running an application on the cluster
Now that the cluster is all set up, let’s give it a spin. We are going to run an application and see if it runs on both the machines. Run the following command:
$ ./bin/spark-submit --master spark://ip-172-34-12-158:7077 examples/src/main/python/pi.py 14
Go to http://MASTER-IP:8080 and you see an item listed under “Running Executors”. Once the application finishes running and you see “Pi is roughly 3.142842” printed on the terminal, this job will be moved to “Finished Executors”.
Run the above command again and go to http://SLAVE-IP:8081. You see an item listed under “Running Executors”. Once it finishes running, this job will be moved to “Finished Executors”. You will now be able to see two items listed under this section. You are now successfully running a Spark cluster!
———————-———————————————————————————–