Sunday, January 31, 2016

Connecting to Postgres from IBM Bluemix - Juypter Notebooks on Spark

Connecting to Postgres from IBM Bluemix - Juypter Notebooks on Spark

  1. Create an account in bluemix(ibm offers 30 days free trial) - https://console.ng.bluemix.net/registration/

  2. Create a spark service (https://www.ng.bluemix.net/docs/services/AnalyticsforApacheSpark/index.html)

  3. Now create notebook with scala as language.
    1. Download the postgres jar using %Addjar method to add a third party jar.
      %Addjar -f https://jdbc.postgresql.org/download/postgresql-9.4.1207.jre7.jar

    2. Import the two classes SparkConf and SparkContext
      import org.apache.spark.{SparkConf, SparkContext}

    3. First statement simply creates a SparkConf configuration object from Spark's initial context "sc"
      Then conf.setJars is magic statement that specify which all jars to be added to the new Sparkcontext we are going to create.(In this case as we have downloaded postgres driver jar, it will add this new jar to new spark context we created. (Simply copy paste the statement as it is so complex to modify:))
      val conf = sc.getConf conf.setJars(ClassLoader.getSystemClassLoader.asInstanceOf[java.net.URLClassLoader].getURLs.map(_.toString).toSet.toSeq ++ kernel.interpreter.classLoader.asInstanceOf[java.net.URLClassLoader].getURLs.map(_.toString).toSeq) conf.set("spark.driver.allowMultipleContexts", "true") conf.set("spark.master","local[*]") val scPostgres = new SparkContext(conf)

    4. Import the SQLContext class for further dataframe and other use
      import org.apache.spark.sql.{SQLContext}

    5. Simply replace url with your postgres url.
      dbtable with name of the table for which you want to create dataframe.
      replace user and password for your postgres database.
      Note in url:- You can opt to remove sslmode argument depending on the configuration of the Postgres Server.
      val url = "jdbc:postgresql://ec2-75-101-163-171.compute-1.amazonaws.com:5432/d7vad26hel3q5l?sslmode=require" val dbtable = "public.test" val user = "" val password = "" val options = scala.collection.Map("url" -> url, "driver" -> "org.postgresql.Driver", "dbtable" ->dbtable,"user"->user,"password"->password)

    6. Now create new SQLContext from your new Spark Context which has postgres driver loaded
      val ncsqlContext = new SQLContext(scPostgres)

    7. Create a dataframereader from your SQLContext for your table
      val dataFrameReader = ncsqlContext.read.format("jdbc").options(options)

    8. Call the load method to create DataFrame for your table.
      val tableDataFrame = dataFrameReader.load()

    9. Call show() method to display the table contents in the Notebook
      tableDataFrame.show()


  4. You have successfully created a dataframe.
  5.