Performing Windowed Computations On Streaming Data Using Spark In Python

December 29, 2015November 14, 2015 ~ Prateek Joshi ~ Leave a comment

We deal with real time data all the time. If you look at those analytics dashboards, you can see how they perform computations and tell us what happened in the last 60 mins or may be the last 7 hours. They are dealing with terabytes of data and yet they can process all of that in real time. These insights are extremely valuable because you can take the right actions if you know what’s happening. If you have a shopping website, you need to know what happened in the last few hours so that you can boost your sales. Are there a lot of visitors from France? Can I organize a quick French themed promotion to increase my sales during peak hours? The answers to all these lies deep within your data. Spark Streaming is amazing at these things! So how do we do windowed computations in Spark? How can we process this data in real time? Continue reading “Performing Windowed Computations On Streaming Data Using Spark In Python” →

Analyzing Real-time Data With Spark Streaming In Python

December 22, 2015November 14, 2015 ~ Prateek Joshi ~ 1 Comment

There is a lot of data being generated in today’s digital world, so there is a high demand for real time data analytics. This data usually comes in bits and pieces from many different sources. It can come in various forms like words, images, numbers, and so on. Twitter is a good example of words being generated in real time. We also have websites where statistics like number of visitors, page views, and so on are being generated in real time. There are so much data that it is not very useful in its raw form. We need to process it and extract insights from it so that it becomes useful. This is where Spark Streaming comes into the picture! It is exceptionally good at processing real time data and it is highly scalable. It can process enormous amounts of data in real time without skipping a beat. So how exactly does Spark do it? How do we use it? Continue reading “Analyzing Real-time Data With Spark Streaming In Python” →

Getting Started With Apache Spark In Python

November 17, 2015November 18, 2015 ~ Prateek Joshi ~ Leave a comment

In one of the previous blog posts, we discussed how to get Apache Spark up and running on your Ubuntu box. In this post, we will start exploring it. One of the best things about Spark is that it comes with a Python API that works like a charm! The API also available in Java, Scala, and R. That pretty much covers the entire world of programming and data science! Spark’s shell provides a great way to analyze our data and work with it interactively. We are going to see how to interact with Spark Python API in this post. You would have downloaded Spark on your machine. Let’s go into “spark-1.5.1” directory on your terminal and get started, shall we? Continue reading “Getting Started With Apache Spark In Python” →

Understanding Filter, Map, And Reduce In Python

November 10, 2015November 10, 2015 ~ Prateek Joshi ~ 3 Comments

Even though lot of people use Python in an object oriented style, it has several functions that enable functional programming. For those of you who don’t know, functional programming is a programming paradigm based on lambda calculus that treats computation as an evaluation of mathematical functions. Some of prominent functional programming languages include Scala, Haskell, Clojure, and so on. You should go through this nice article on functional programming that explains it in layman’s terms. Coming back to the topic at hand, Python provides features like lambda, filter, map, and reduce that can basically cover most of what you would need to operate on data. Let’s go ahead and play with them to understand their awesomeness, shall we? Continue reading “Understanding Filter, Map, And Reduce In Python” →

How To Install Apache Spark On Ubuntu

November 3, 2015July 11, 2016 ~ Prateek Joshi ~ 3 Comments

There’s so much data being generated in today’s world that we need platforms and frameworks that it’s mind boggling. This field of study is called Big Data Analysis. With so much data lying around, often ranging in petabytes and exabytes, we need super powerful systems to process it. Not only that, we need to do it high efficiency. If you try to do it using your regular ways, you will never be able to do anything in time, let alone doing it in real-time. This is where Apache Spark comes into picture. It is an open source big data processing framework that can process massive amounts of data at high speed using cluster computing. Let’s see how we can install it on Ubuntu. Continue reading “How To Install Apache Spark On Ubuntu” →

Enabling Tab Autocomplete In Python Shell

October 27, 2015October 11, 2015 ~ Prateek Joshi ~ 3 Comments

It’s fun to play around with Python. One of its best features is the interactive shell where we can experiment all we want. Let’s say you open up a shell, declare a bunch of variables and want to operate on them. You don’t want to type the full variables names over and over again, right? Also, it’s difficult to remember the full names of all the inbuilt methods and functions as well. Since we are playing around with the same variables and inbuilt functions, it would be nice to have an autocomplete feature that can complete the variable and function names for us. Fortunately, Python provides that nifty little feature! Let’s see how we can enable it here. Continue reading “Enabling Tab Autocomplete In Python Shell” →

How To Add Swap Space On Ubuntu

March 21, 2015 ~ Prateek Joshi ~ Leave a comment

Whenever you are building an application that’s memory intensive, you are bound to run into memory issues. Those out of memory errors are painful to deal with, especially when they happen during production. Before putting your code on your server, you need to make sure that it can handle the application’s memory requirements. But even if you are careful, something might still go wrong and you might end up running into memory issues. One of the easiest ways to deal with this is by adding some swap space. Now how will it help our case? How can we use it on Ubuntu? Continue reading “How To Add Swap Space On Ubuntu” →

What Is External Sorting?

February 2, 2015 ~ Prateek Joshi ~ 1 Comment

Sorting is one of the most common things we do in programming. We are given a bunch of numbers and we want to arrange them according to some rule. Let’s say we want to arrange them in ascending order. To sort these numbers, people tend to use a sorting algorithm that takes place entirely within the memory of a computer. The memory we are talking about is the RAM. Here, we take all the numbers and store them in the memory so that we can sort them. This is possible only when the amount of data is small enough to be stored in the memory. What if we have a hundred trillion numbers to be sorted? It’s too big to be stored in the computer’s memory. How do we do it? Continue reading “What Is External Sorting?” →