Big Data Authority: 2016

Tuesday, August 9, 2016

Data Workflow Management in Big Data Analytics

Workflow Management in Big Data Analytics

So now, you have this big powerful analytics cluster of 500+ nodes and now suddenly you have lots of team around your organization ready to attack your cluster with heavy jobs.

You need a way to schedule and manage this jobs in the data pipeline and that is where data wokflow management tool like Airflow, Nodered come into picture.

Airflow

Nodered

Friday, August 5, 2016

Messaging Queue Systems - Kafka, Mesos, RabbitMQ, ZeroMQ, Apache ActiveMQ, OpenMP

Kafka

Getting started - http://blog.antlypls.com/blog/2015/10/05/getting-started-with-spark-streaming-using-docker/

Mesos
RabbitMQ
ZeroMQ
Apache ActiveMQ
OpenMP

Thursday, July 28, 2016

Connect to Cloudant database from SparkR

How to Connect to Cloudant Database from SparkR kernel

Below i will show how to do it from Bluemix but it will apply Jupyter Notebook running on any environment.

Connecting to Cloudant from IBM Bluemix - Juypter Notebooks on Spark

Create an account in bluemix(ibm offers 30 days free trial) - https://console.ng.bluemix.net/registration/

Create a spark service (https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html)
Now create notebook with sparkR as language.

spark context needs to know which driver to use to connect to Cloudant database. In bluemix spark service enivornment the driver is loaded by default.
https://github.com/cloudant-labs/spark-cloudant
Also if you are in different environment, you use binary
https://github.com/cloudant-labs/spark-cloudant/releases/download/v1.6.3/cloudant-spark-v1.6.3-125.jar
For ex. use %Addjar -f https://github.com/cloudant-labs/spark-cloudant/releases/download/v1.6.3/cloudant-spark-v1.6.3-125.jar to add it to your spark.

Once you have spark-cloudant connector in your spark.
You are going to need to have 3 configuration parameters set for you spark context

cloudant.host","ACCOUNT.cloudant.com"

"cloudant.username", "USERNAME"
"cloudant.password","PASSWORD"

So in sparkR, you would need to use one of the sparkEnv variable

to pass your environment variables to all the executors.

sc <- sparkR.init(sparkEnv = list("cloudant.host"="c8dca934-d2a4-4dcc-9123-2189ce9f5812-bluemix.cloudant.com","cloudant.username"="c8dca934-d2a4-4dcc-9123-2189ce9f5812-bluemix","cloudant.password"="XXXXXXXXXXXXXXXXXXXX"))

Once you execute above. Your sparkcontext is ready to use cloudant-connector.

All you need to do is specify that you are reading using com.cloudant.spark

people <- read.df(sqlContext, database, header='true', source = "com.cloudant.spark",inferSchema='true')

I have complete Notebook published on this github repo. Feel Free to use it.

Friday, June 17, 2016

Difference Between Spark and Hadoop Map-reduce

Difference Between Spark and Hadoop

Difference	Spark	Hadoop Map-reduce
1. Perfomance	Itertaive computations are performed in-memory, the mapper functions just transform one RDD to another RDD, resulting in saving disk io,network io and improving performance	Map and Reduce phases cause every mapper/reducer to write data to disk after mapping and then successive mapper/reducer to read from it, thus resulting in disk io,network io, causing latency
2. Programming Languages	Scala,Java,Python,R	Java
3. Basic Unit of Data	RDD - Resilient Distributed Dataset	Tuples
4. Lines of Code for WordCount	as less as 6 in python code. refer here	as less as 73 in Java code. refer here

Monday, June 13, 2016

Spark 2.0 is out

Spark Summit East Keynote: Apache Spark 2.0

How do you get your hands on Spark 2.0 :-
1. Databricks Community Edition
2. Download and set it up

Major features:-

Tungsten Phase 2 speedups of 5-10x
Structured Streaming real-time engine on SQL/DataFrames
Unifying Datasets and DataFrames

Big Data Authority

Pages