Apache Spark is marketed as “lightning fast cluster computing” and it stands true to its word! It can do amazing things really quickly using a cluster of machines. So how do we assemble that cluster? Let’s say you are using a cloud service provider like Amazon Web Services. Do we need to spin up a bunch of instances ourselves to form a “cluster”? Well, not really! Spark can launch a cluster by itself and you can control everything from one machine. You just need to log into your main instance and Spark will automatically launch all the instances in the cluster for you. It’s beautiful! Let’s go ahead and see how to launch a cluster, shall we?
Creating a key pair
You need to create an Amazon EC2 key pair for yourself so that Spark can talk to all the instances. This is basically the “.pem” file that you use to log into your instances. If you don’t have it, go to your AWS console and click “Key Pairs” on the left sidebar. You can then create and download a key pair. Once you download it, you should set the permissions to 600 (so that only you can read and write this file) by running the following command:
$ chmod 600 filename.pem
We did this because we want ssh to work here. In order for ssh to work, we need to change the permissions accordingly.
Setting the policy for IAM user
You need to set the policy for your IAM user to enable Spark to launch instances. Go to your AWS console and then go to: “Your name (top bar) > Security Credentials” and then click on “Users” in the left sidebar. Click on the user name in the list and scroll down to “Managed Policies”. Now click on “Attach Policy” (it’s a blue button). You will now be presented with a list of options. Scroll down and pick “AmazonEC2FullAccess”. This means that this IAM user has permissions to launch new instances.
Setting the environment variables
Go into the EC2 instance where you are running Spark. We have to set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to your Amazon EC2 access key ID and secret access key. You can get them in your AWS console by going to “Your name (top bar) > Security Credentials > Access Credentials”. Open up your “~/.bashrc” file and add these two lines in the end:
export AWS_ACCESS_KEY_ID=ABCDXYZ1234567890987 export AWS_SECRET_ACCESS_KEY=DaCbCpAwEs123456iJjKkLlMm9876JoSqRpQsTiL
Launching the cluster
We are now ready to launch our cluster. Let’s launch a cluster with 1 master and 1 slave. You can launch a bigger cluster once you get the hang of it. Make sure you know what zone your AWS instances are located in. Go into your “spark-1.5.1/ec2” folder and run the following command:
$ ./spark-ec2 --slaves 1 --key-pair=mykeypair --identity-file=/path/to/mykeypair.pem --region=us-west-1 --zone=us-west-1a launch my-cluster
If everything goes well, you will see 2 new instances in your AWS console named something like “my-cluster-master-i-4221aaff” and “my-cluster-slave-i-4221aaff”. You can go to http://master-ip:8080 (where master-ip is the IP address of “my-cluster-master-i-4221aaff”) to see the UI. You are all set!