In this section, we'll build and run Scala-Spark/PySpark applications using LearningSparkV2/chapter2 at master · databricks/LearningSparkV2 · GitHub.
In this part, we'll develop a Scala-Spark application with
sbt (about installing the
sbt on your EC2 instance, please refer to https://www.scala-sbt.org/1.x/docs/Installing-sbt-on-Linux.html). Before building the application, note the supported Scala version at here.
Spark runs on Java 8/11, Scala 2.12, Python 2.7+/3.4+ and R 3.5+. Java 8 prior to version 8u92 support is deprecated as of Spark 3.0.0. Python 2 and Python 3 prior to version 3.6 support is deprecated as of Spark 3.0.0. For the Scala API, Spark 3.0.1 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x).
Then, we clone the repository.
After the cloning the repo, we put the sample dataset;
./LearningSparkV2/chapter2/scala/data/mnm_dataset.csv on HDFS as follows:
Before building the application, we update the
build.sbt for the compatiblity with our cluster environment.
Then, let's build the package with
Finally, let's run your code via
You will be able to get the following results with
yarn logs, if it runs successfully:
As same as the first one, we run the application with PySpark. The default python version of EC2 instance is
Python 2.7.18. Basically Python3 should be used in the current PySpark. However the sample code is also compatible with Python2. Therefore, we'll use it without any changes. If you want to use Python3 on your machine, please update that version.
Let's run the
mnmcount.py as follows.
Then, you can get the same result.
In this section, we'll create a streaming processing application with "Structured Streaming". Structured streaming is one of the component in Apache Spark. This component is built on tops of the SparkSQL abstraction as the following diagram. We don't see the details of each Spark components in this session. If you want to learn about streaming components, the following resources should be very helpful.
- Stream Processing with Apache Spark Book
- Spark: The Definitive Guide Book
- Learning Spark, 2nd Edition Book
We'll use csv FileSource for the streaming application in this section. The csv file source needs schema (you cannot use
inferSchem option). Therefore, firstly we put the partial sample data which we used in the previous secion, on HDFS. In the streaming application, we'll extract the schema with DataFrame.
Here's the streaming application code. In this case, you just run these code thorough
spark-shell consol. The first code means getting the file schema with DataFrame
inferSchema option. Using this schema, we process the data which is put on HDFS.
After running the above code, we put the
mnm_dataset.csv on the
/user/tomtan/streaming directory on HDFS. Then, the streaming application processes the data, and you can get the following result.