In this post, we configure and visualize Spark metrics (e.g. Executor JVM Heap usage) for Spark application running on Amazon EMR (hereafter, EMR) using InfluxDB and Grafana as follows.
- Setup InfluxDB on Amazon EC2
- Setup Grafana on Amazon EC2
- Configuring Spark Metrics
- Monitoring Spark Application running on EMR
- EMR (Release Label): emr-5.27.0
- Spark 2.4.4
- Hadoop 2.8.5
- Amazon Linux 2 instance (ami-0a887e401f7654935) was used in us-east-1 for InfluxDB and Grafana.
- Used dataset for Spark application: Amazon Customer Reviews Dataset
In this post, we skip the setup of EC2 instance. After login the EC2 instance through SSH, install InfluxDB as follows (Ref: Installing InfluxDB OSS):
The document says that you set
name = InfluxDB Repository - RHEL \$releasever to
influxdb.repo, however this configuration fails with 404 error in this case. (Ref: repo Add support for Amazon Linux · Issue #5035 · influxdata/influxdb · GitHub)
After adding repository, you can install InfluxDB:
Then, start InfluxDB and check if it is running:
As same as the previous section about InfluxDB, you’ll add repository of Grafana, install and start Grafana-server (Ref: Install on RPM-based Linux | Grafana Labs)
If possible, you should change the default configuration of username and password for Grafana. After this setup, you can access the Grafana-server through specified port number and then you can see following image:
In this section, you configure Spark metrics with login EMR via SSH. After launching EMR, login the Master Node and open
/etc/spark/conf/metrics.properties. you can configure Spark metrics by adding
Sink class to this property file. In this time, we'll configure pushing metrics to InfluxDB from Spark application, therefore add some InfluxDB configuration as follows:
After completion of Spark metrics, download jar files from following links and put them on home directory on your EMR (in this time). These jar files are used for getting Spark metrics based on DropWizard metrics format and pushing metrics to InfluxDB on EC2.
- Maven Repository: com.palantir.spark.influx » spark-influx-sink » 0.4.0
- Maven Repository: com.izettle » dropwizard-metrics-influxdb » 1.2.3
In this section, you write Spark code, and then run the code through
spark-submit. After that, you visualize Spark metrics (in this time, we'll see JVM Heap metrics) using Grafana (query to InfluxDB).
In this section, we'll use following PySpark code as an exmple.
Then, run the code with
cluster mode. When executing
spark-submit, don't forget to pass
spark.executor.extraClassPath arguments and jars to the command.
After confirmation of running Spark application, check if Spark application pushing metrics at first. To check this, you need to access the EC2 instance via SSH and run a query for InfluxDB as follows.
You can confirm stored metrics in InfluxDB, then access to Grafana running on the EC2 intance. You'll add InfluxDB as a data source (see How to setup Grafana for InfluxDB) and then create a panel to visualize Spark metrics for the Dashboard. In that page, you can set a query to get metrics which are pushed to InfluxDB by the Spark application as follows. Additionally, add the title and description of the panel. After those configuration, save this panel from "save" button in the upper side of the page. Finally you can monitor metrics of Spark driver and executor JVM Heap usage!
After login Grafana, you need "data source" firstly. You can add the data source through following steps:
- Click "Configuration" (from left-side bar)
- Click "Data Sources"
- Go to "Add data source" and then select "InfluxDB"
- Just specify "URL" and "Database" (as follows), and move on "Save & Test"
After adding data source, you can add a panel to the dashboard by clicking "+" > "Dashboard" > "Add Query" in the left-bar.
If you want some periodic metrics, you can see with following query as an example:
- AWS service icons are downloaded from Architecture Icons.
- Apache Spark logo is downloaded from Index of /images.
- InfluxDB logo is downloaded from Downloads / InfluxData Branding Docs.
- Grafana logo is downloaded from grafana/grafana_icon.svg at master · grafana/grafana · GitHub.
- GitHub - palantir/spark-influx-sink: A Spark metrics sink that pushes to InfluxDb