Thursday, July 28, 2016

Connect to Cloudant database from SparkR

How to Connect to Cloudant Database from SparkR kernel



Below i will show how to do it from Bluemix but it will apply  Jupyter Notebook running on any environment.

Connecting to Cloudant from IBM Bluemix - Juypter Notebooks on Spark


  1. Create an account in bluemix(ibm offers 30 days free trial) - https://console.ng.bluemix.net/registration/

  2. Create a spark service (https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html) 
  3. Now create notebook with sparkR as language.  
    • spark context needs to know which driver to use to connect to Cloudant database. In bluemix spark service enivornment the driver is loaded by default. 
    •  https://github.com/cloudant-labs/spark-cloudant
    •  Also if you are in different environment, you use binary
    • https://github.com/cloudant-labs/spark-cloudant/releases/download/v1.6.3/cloudant-spark-v1.6.3-125.jar
    • For ex. use %Addjar -f https://github.com/cloudant-labs/spark-cloudant/releases/download/v1.6.3/cloudant-spark-v1.6.3-125.jar to add it to your spark.
  4. Once you have spark-cloudant connector in your spark.
  5.  You are going to need to have 3 configuration parameters set for you spark context
    • cloudant.host","ACCOUNT.cloudant.com"
      "cloudant.username", "USERNAME"
      "cloudant.password","PASSWORD"
       
      So in sparkR, you would need to use one of the sparkEnv variable
      to pass your environment variables to all the executors.
      sc <- sparkR.init(sparkEnv = list("cloudant.host"="c8dca934-d2a4-4dcc-9123-2189ce9f5812-bluemix.cloudant.com","cloudant.username"="c8dca934-d2a4-4dcc-9123-2189ce9f5812-bluemix","cloudant.password"="XXXXXXXXXXXXXXXXXXXX")) 
       
       
      Once you execute above. Your sparkcontext is ready to use cloudant-connector.
      All you need to do is specify that you are reading using com.cloudant.spark
       
      people <- read.df(sqlContext, database, header='true', source = "com.cloudant.spark",inferSchema='true') 
      
      
      I have complete Notebook published on this github repo. Feel Free to use it.