Thank you for giving your suggestion. I was able to resolve this issue by using SBT assembly only. Following is the details regarding how I did this.
Spark - Already present in Cloudera VM
Scala - Not sure if this is present in Cloudera, if not we can install it
SBT - This also needs to be installed. I did both the installs on my local machine and transferred the Jar to the VM. For installing this I used the following link
https://gist.github.com/visenger/5496675
1) Once all these are created. We have to create the parent folder for our project. I created a folder called Twitter.
2) Create another folder with the following structure Twitter/src/main/scala and created the .scala file in this folder with the name TwitterPopularTags.scala. This has slight changes from the code which we got from the github. I had to change the import statements
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.twitter._
import org.apache.spark.SparkConf
3) After this, create another folder under the parent folder with the following name
Twitter/project
and create a file with the name assembly.sbt . This has the path for the assembly plugin. Following is the full code present in the file.
resolvers += Resolver.url("sbt-plugin-releases-scalasbt", url("http://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/"))
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.12.0")
4) Once the above two are created, create a file in the parent directory of the project (Twitter) with the name build.sbt. This is where we need to provide the name of the .Jar file we need to create and also the dependencies. Please note that even the blank lines between the codes in this file are important.
name := "TwitterPopularTags"
version := "1.0"
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
}
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.1.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-twitter" % "1.2.0"
libraryDependencies += "org.twitter4j" % "twitter4j-stream" % "3.0.3"
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
5) Finally we have to open the terminal and go to the parent folder of the project (Twitter). From here enter the following command:
sbt assembly
This will download the dependencies and create the jar file we need.
6) In order to run the program we need a twitter app created under our ID and provide the auth token and other details. The detailed step on how to create this is present in following link.
http://ampcamp.berkeley.edu/3/exercises/realtime-processing-with-spark-streaming.html
7) Once we have all the above done we can use the spark-submit command from VM to run the job. Example command is
./bin/spark-submit
--class TwitterPopularTags
--master local[4]
/path/to/TwitterPopilarTags.jar
comsumerkey consumersecret accesstoken accesssecret
8) This prints the output to the console so to monitor the output it is better to reduce the frequency
by adjusting the code.
Please let me know if any more details are required.
Thanks & Regards,
VPV