Friday, June 17, 2016

Difference Between Spark and Hadoop Map-reduce

Difference Between Spark and Hadoop


Difference Spark Hadoop Map-reduce
1. Perfomance Itertaive computations are performed in-memory, the mapper functions just transform one RDD to another RDD, resulting in saving disk io,network io and improving performance Map and Reduce phases cause every mapper/reducer to write data to disk after mapping and then successive mapper/reducer to read from it, thus resulting in disk io,network io, causing latency
2. Programming Languages Scala,Java,Python,R Java
3. Basic Unit of Data RDD - Resilient Distributed Dataset Tuples
4. Lines of Code for WordCount as less as 6 in python code. refer here as less as 73 in Java code. refer here

Monday, June 13, 2016

Spark 2.0 is out


Spark Summit East Keynote: Apache Spark 2.0


How do you get your hands on Spark 2.0 :-
1. Databricks Community Edition
2. Download and set it up


Major features:-
  1.  Tungsten Phase 2 speedups of 5-10x
  2. Structured Streaming real-time engine on SQL/DataFrames
  3. Unifying Datasets and DataFrames