Showing posts with label jupyter. Show all posts
Showing posts with label jupyter. Show all posts

Thursday, July 28, 2016

Connect to Cloudant database from SparkR

How to Connect to Cloudant Database from SparkR kernel



Below i will show how to do it from Bluemix but it will apply  Jupyter Notebook running on any environment.

Connecting to Cloudant from IBM Bluemix - Juypter Notebooks on Spark


  1. Create an account in bluemix(ibm offers 30 days free trial) - https://console.ng.bluemix.net/registration/

  2. Create a spark service (https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html) 
  3. Now create notebook with sparkR as language.  
    • spark context needs to know which driver to use to connect to Cloudant database. In bluemix spark service enivornment the driver is loaded by default. 
    •  https://github.com/cloudant-labs/spark-cloudant
    •  Also if you are in different environment, you use binary
    • https://github.com/cloudant-labs/spark-cloudant/releases/download/v1.6.3/cloudant-spark-v1.6.3-125.jar
    • For ex. use %Addjar -f https://github.com/cloudant-labs/spark-cloudant/releases/download/v1.6.3/cloudant-spark-v1.6.3-125.jar to add it to your spark.
  4. Once you have spark-cloudant connector in your spark.
  5.  You are going to need to have 3 configuration parameters set for you spark context
    • cloudant.host","ACCOUNT.cloudant.com"
      "cloudant.username", "USERNAME"
      "cloudant.password","PASSWORD"
       
      So in sparkR, you would need to use one of the sparkEnv variable
      to pass your environment variables to all the executors.
      sc <- sparkR.init(sparkEnv = list("cloudant.host"="c8dca934-d2a4-4dcc-9123-2189ce9f5812-bluemix.cloudant.com","cloudant.username"="c8dca934-d2a4-4dcc-9123-2189ce9f5812-bluemix","cloudant.password"="XXXXXXXXXXXXXXXXXXXX")) 
       
       
      Once you execute above. Your sparkcontext is ready to use cloudant-connector.
      All you need to do is specify that you are reading using com.cloudant.spark
       
      people <- read.df(sqlContext, database, header='true', source = "com.cloudant.spark",inferSchema='true') 
      
      
      I have complete Notebook published on this github repo. Feel Free to use it.

Sunday, January 31, 2016

Connecting to Postgres from IBM Bluemix - Juypter Notebooks on Spark

Connecting to Postgres from IBM Bluemix - Juypter Notebooks on Spark

  1. Create an account in bluemix(ibm offers 30 days free trial) - https://console.ng.bluemix.net/registration/

  2. Create a spark service (https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html)

  3. Now create notebook with scala as language.
    1. Download the postgres jar using %Addjar method to add a third party jar.
      %Addjar -f https://jdbc.postgresql.org/download/postgresql-9.4.1207.jre7.jar

    2. Import the two classes SparkConf and SparkContext
      import org.apache.spark.{SparkConf, SparkContext}

    3. First statement simply creates a SparkConf configuration object from Spark's initial context "sc"
      Then conf.setJars is magic statement that specify which all jars to be added to the new Sparkcontext we are going to create.(In this case as we have downloaded postgres driver jar, it will add this new jar to new spark context we created. (Simply copy paste the statement as it is so complex to modify:))
      val conf = sc.getConf conf.setJars(ClassLoader.getSystemClassLoader.asInstanceOf[java.net.URLClassLoader].getURLs.map(_.toString).toSet.toSeq ++ kernel.interpreter.classLoader.asInstanceOf[java.net.URLClassLoader].getURLs.map(_.toString).toSeq) conf.set("spark.driver.allowMultipleContexts", "true") conf.set("spark.master","local[*]") val scPostgres = new SparkContext(conf)

    4. Import the SQLContext class for further dataframe and other use
      import org.apache.spark.sql.{SQLContext}

    5. Simply replace url with your postgres url.
      dbtable with name of the table for which you want to create dataframe.
      replace user and password for your postgres database.
      Note in url:- You can opt to remove sslmode argument depending on the configuration of the Postgres Server.
      val url = "jdbc:postgresql://ec2-75-101-163-171.compute-1.amazonaws.com:5432/d7vad26hel3q5l?sslmode=require" val dbtable = "public.test" val user = "" val password = "" val options = scala.collection.Map("url" -> url, "driver" -> "org.postgresql.Driver", "dbtable" ->dbtable,"user"->user,"password"->password)

    6. Now create new SQLContext from your new Spark Context which has postgres driver loaded
      val ncsqlContext = new SQLContext(scPostgres)

    7. Create a dataframereader from your SQLContext for your table
      val dataFrameReader = ncsqlContext.read.format("jdbc").options(options)

    8. Call the load method to create DataFrame for your table.
      val tableDataFrame = dataFrameReader.load()

    9. Call show() method to display the table contents in the Notebook
      tableDataFrame.show()


  4. You have successfully created a dataframe.
  5.