Monday, June 13, 2016

Spark 2.0 is out


Spark Summit East Keynote: Apache Spark 2.0


How do you get your hands on Spark 2.0 :-
1. Databricks Community Edition
2. Download and set it up


Major features:-
  1.  Tungsten Phase 2 speedups of 5-10x
  2. Structured Streaming real-time engine on SQL/DataFrames
  3. Unifying Datasets and DataFrames








Thursday, June 9, 2016

Running your first R notebook on IBM Bluemix Apache Spark Service

Running your first R notebook on IBM Bluemix Apache Spark Service

IBM Bluemix Apache Spark Service have introduced R -tech preview for allowing users to run R programs on spark cluster.
https://developer.ibm.com/clouddataservices/docs/spark/technical-previews/r-in-jupyter-notebooks/
So how do you get yourself started on R notebook on Spark.

You would need to create new instance of the service as tech preview was introduced in May 2016. Please check it out.
I have a simple example of PI Calculator here, if you just want to import and give the service a try:- https://github.com/charles2588/bluemixsparknotebooks/raw/master/R/Pi_Bluemix.ipynb

Free Beta Data Science Tools with Spark

Below are the links to beta programs / community editions to allow to test your spark programs on spark servers without having to setup anything.

IBM

Sign Up for IBM Data Science Experience. Beta wait-list.
http://datascience.ibm.com/


Databricks

Sign up for Community Edition
This gives you free spark instance. Beta wait-list.

https://databricks.com/try-databricks

Thursday, April 7, 2016

PANCAKE STACK -- New Data Science Stack




PANCAKE STACK -- New Data Science Stack














  1. Presto
  2. Arrow
  3. NiFi
  4. Cassandra
  5. AirFlow
  6. Kafka
  7. ElasticSearch
  8. Apache-Spark
  9. TensorFlow
  10. Algebird
  11. CoreNLP
  12. Kibana

Architecture:-


Thursday, March 17, 2016

Connecting to MongoDB from IBM Bluemix - Juypter Notebooks on Spark

Connecting to MongoDB from IBM Bluemix - Juypter Notebooks on Spark

  1. Create an account in bluemix(ibm offers 30 days free trial) - https://console.ng.bluemix.net/registration/

  2. Create a spark service (https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html)

  3. Now create notebook with scala as language.
    1. Add unityJDBC jar which has mongodb driver.
      %Addjar https://github.com/charles2588/SparkNotebooksJars/raw/master/unityjdbc.jar
      
      
      
    2. Add Mongo Java Driver jar which unityJDBC need
      %Addjar https://github.com/charles2588/SparkNotebooksJars/raw/master/mongo-java-driver-2.13.3.jar
      
      
      
    3.  Test below import
      import mongodb.jdbc.MongoDriver
       
      
    4. Import the two classes SparkConf and SparkContext
      import org.apache.spark.sql.{DataFrame, SQLContext} 
    5. Simply replace url with your mongodb url.
      dbtable with name of the table for which you want to create dataframe.
      replace user and password for your db2 database server.
      val url = "jdbc:mongo://ds045252.mlab.com:45252/samplemongodb"
      val dbtable = "Photos"
      val user = "charles2588"
      val password = "*****"
      val options = scala.collection.Map("url" -> url,"driver" -> "mongodb.jdbc.MongoDriver","dbtable" ->dbtable,"user"->user,"password"->password)
      

    6. Now create new SQLContext from your new Spark Context which has db2 driver loaded
      val sqlContext = new SQLContext(sc)

    7. Create a dataframereader from your SQLContext for your table
      val dataFrameReader = sqlContext.read.format("jdbc").options(options)

    8. Call the load method to create DataFrame for your table.
      val tableDataFrame = dataFrameReader.load()

    9. Call show() method to display the table contents in the Notebook
      tableDataFrame.show()

  4. You have successfully created a dataframe from mongodb , now you can do further processing according to your need.
  5.