Getting Started With Apache Spark In Python

In one of the previous blog posts, we discussed how to get Apache Spark up and running on your Ubuntu box. In this post, we will start exploring it. One of the best things about Spark is that it comes with a Python API that works like a charm! The API also available in Java, Scala, and R. That pretty much covers the entire world of programming and data science! Spark’s shell provides a great way to analyze our data and work with it interactively. We are going to see how to interact with Spark Python API in this post. You would have downloaded Spark on your machine. Let’s go into “spark-1.5.1” directory on your terminal and get started, shall we?

Starting with something simple

Let fire up the Spark Python shell:

$ ./bin/pyspark

This will print a message and open the shell. There is a “README.md” file located in your download Spark directory. Let’s load it:

>>> my_file = sc.textFile("README.md")

It’s important at this point to understand a little bit about how Spark handles data. The primary abstraction in Spark is called Resilient Distributed Dataset (RDD). It is a distributed collection of items that Spark uses to handle the data efficiently. In our example here, “my_file” is a new RDD that we just created. The concept of RDD is actually a deep topic that cannot be discussed in just a couple of sentences. To learn more about RDD, you can go through this nice tutorial by Matei Zaharia (creator of Apache Spark). It’s good to understand the basics of RDD is you are planning on using Spark to do big data processing.

Now that we have loaded the file, let’s start with the customary line-counting operation:

>>> my_file.count()

You should see the number of lines printed on the terminal.

How do we interact with an RDD?

Let’s go ahead and understand a little bit about RDDs. To keep it simple for now, you can do two things with RDDs:

Take an “action” –> this processes the input RDD and returns a value
Apply a “transformation” –> this processes the input RDD and returns a new RDD

To demonstrate it, let’s count the number of lines with the word “the” in the file. We will be using lambda functions to do this. If you need a quick refresher on lambda functions along with filter/map/reduce functions, you should go through this blog post. Let’s apply a “transformation” to create a list of lines with the word “the”:

>>> lines_with_the = my_file.filter(lambda line: "the" in line)

Let’s take an “action” on this RDD to get the length:

>>> lines_with_the.count()

You should see the number of lines printed. Just as a side note, you should be using tab autocomplete when you are in the Spark Python shell. You can autocomplete names of variables as well as actions and transformations you can apply to an RDD. If you don’t have it enabled, you can follow the steps in this blog post to enable it. Once you enable it, open up the Spark Python shell and load the RDD:

>>> my_file = sc.textFile("README.md")

Now, you can type “my” and hit “tab”, it will autocomplete it. You can type “my_file.” and hit tab twice, it will display all the inbuilt actions and transformations you can apply to it. Pretty sweet, right? Okay, let’s move forward.

It’s time to do something slightly more complex

Let’s print the line with the highest number of words that contains the word “the”. We can actually do it with a single line of code:

>>> my_file.filter(lambda line: "the" in line).map(lambda line: (line, len(line.split()))).reduce(lambda a, b: a if (a[1] > b[1]) else b)

Too complex? Okay let’s break it down. It’s actually simple and intuitive! We just chained a bunch of operators to get the output. The first one is:

my_file.filter() --> filters the input RDD and returns a subset of that RDD

Keeping that in mind:

my_file.filter(lambda line: "the" in line) --> returns an RDD of lines containing the word "the"

We’ve called this “lines_with_the”. Now the next operator is “map()”:

lines_with_the.map(lambda line: (line, len(line.split()))) --> returns an RDD where each element is a tuple containing the line and the number of words in that line

Let’s call this “line_lengths”. Now the final operator is “reduce()”:

line_lengths.reduce(lambda a, b: a if (a[1] > b[1]) else b)) --> finds the largest element in the RDD by comparing the second element and returns that tuple

This is the tuple we want. You will see that it prints the tuple containing the longest line containing the word “the” and its length. If you want to print the contents of an RDD, just use “collect()”:

>>> my_file.collect()

If you want to print the third element, then you can just use the list-indexing syntax as shown below:

>>> my_file.collect()[2]

You can keep exploring this to learn more about how you can chain operators together to make Spark do amazing things in a distributed way.

———————————————————————————————————

Getting Started With Apache Spark In Python

Published by Prateek Joshi

Leave a comment Cancel reply

Share this:

Related

Published by Prateek Joshi

Leave a comment Cancel reply