*This blog is posted along with Blog from HPE Dev *

This blog is selected as the top 5 HPE Ezmeral Blog in the year 2021.

PySpark is an interface for Apache Spark in Python. Apache Spark is a unified analytics engine for big data processing. It allows developers to perform data processing on files in a distributed filesystem, like the Hadoop distributed filesystem or HPE Ezmeral Data Fabric (formerly known as MapR-XD). Due to its complexity, setting up a Spark environment is always a pain for data scientists. Hopefully, HPE Ezmeral Container Platform can make this much easier.

You can run Apache Spark jobs on Kubernetes-managed clusters on the HPE Ezmeral Container Platform. The HPE Ezmeral Container Platform provides you with access to a wealth of MLOps tools such as Apache Spark Operator, and things like a Kubedirector Jupyter Notebook where you will do your Data Science work and interact with the Apache Spark Operator to run your Apache Spark jobs.

Once logged in as an MLOps tenant member, you can deploy an instance of Jupyter Notebook. From the Jupyter Notebook, you can either run Spark jobs with Apache Livy to make REST API calls to Spark Operator, or you can directly run a Spark job against the Spark Operator with the PySpark module.

In this post, I will focus on running simple Spark jobs using the PySpark module on a Jupyter Notebook cluster instance deployed on HPE Ezmeral Container Platform. For those who want to squeeze the best performance out of Spark and run Spark Jobs with Apache Livy, visit this post.

Preparing the Jupyter Notebook Cluster

First, we have to prepare our favorite Jupyter Notebook environment. Inside a MLOps tenant, navigate to Notebooks tab. You will see a Jupyter Notebook KubeDirector app prepared for you. After clicking the Launch button, you will need to configure the compute resource needed.

As you can see below, you must specify the name of the Jupyter Notebook cluster. Click Enable DataTap to expand access to shared data by specifying a named path to a specified storage resource.

Switching to the Notebook Endpoints tab, you can see that the access points are prepared for you. Just click the link and, after logging in with your LDAP/AD account, your favorite Jupyter environment is ready for you.

Different kernels are already installed for you. No matter which languages you are using, there will be one that suits your ML project. There are two kernels which are Python related, i.e. Python3 and PySpark kernel. Python3 kernel is the kernel for you to run single node workloads while PySpark kernel, connected with Spark Operator and Livy, is the kernel for you to run distributed workloads. In this post, I am running a simple Spark job. So I will pick the Python3 kernel and import the PySpark module in the runtime.

Preparing the datasets

Imagine that you have a very large CSV file ready for analysis. You need to put the file to the distributed filesystem. Of course, you can do that with the graphic user interface provided through HPE Ezmeral Container Platform.

The other way to do this would be to drag the file to the left panel of the local Jupyter Notebook cluster and run the following HDFS commands to put the file to the "TenantStorage" through DataTap.

# Put a file from local filesystem to distributed filesystem using HDFS commands
hdfs dfs -put enhanced_sur_covid_19_eng.csv dtap://TenantStorage/enhanced_sur_covid_19_eng.csv
# List the files or directories
hdfs dfs -ls dtap://TenantStorage/
# List the files or directories
hdfs dfs -tail dtap://TenantStorage/enhanced_sur_covid_19_eng_.csv

Getting Started with PySpark

The PySpark module is already installed for you. No extra installation is needed. So convenient, isn't it? Some configurations for the PySpark runtime are needed in order to read files from DataTap.

# python3 kernel
from PySpark import SparkConf, SparkContext

# Specify the path of the jars files
conf = SparkConf().set("spark.jars", "/opt/bdfs/bluedata-dtap.jar")
sc = SparkContext(conf=conf)
# Specify the Hadoop configurations.
sc._jsc.hadoopConfiguration().set('fs.dtap.impl', 'com.bluedata.hadoop.bdfs.Bdfs')
sc._jsc.hadoopConfiguration().set('fs.AbstractFileSystem.dtap.impl', 'com.bluedata.hadoop.bdfs.BdAbstractFS')

Reading datasets from HPE Ezmeral Data Fabric

After some configuration, your Spark engine is connected to the platform and you can now read files from HPE Ezmeral Data Fabric through Data Tap.

# Commands for reading DataTap file.
text = sc.textFile("dtap://TenantStorage/hello.txt")
text.take(5)

For reading CSV files as a Spark dataframe, run the following commands:

# Commands for importing PySpark SQL module.
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Commands for reading DataTap csv file.
df = spark.read.csv('dtap://TenantStorage/enhanced_sur_covid_19_eng.csv', header=True, inferSchema=True)
df.take(3)

Data analytics with PySpark

With PySpark, you can easily read data from files, cleanse data and do analytics within a Jupyter notebook. To view the entire notebook, click here.

Screenshot	Description
	You can run `df.printSchema()` to view the schema of your dataframe.
	These are the commands for selecting columns of data and filtering according to the criteria.
	Some common commands to interact with your datasets.
	Commands for data aggregation.
	Example for visualizing your datasets.

Possible error

You may encounter the error, "permission denied", when running HDFS commands. To solve this error, you have to "exec" into the pod and change the access mode for the core-site.xml.

You can "exec" into it through the Jupyter notebook or using the WebTerminal that comes along with HPE Ezmeral Container Platform. To grab the Kubectl credential from HPE Ezmeral Container Platform, run the following command:

# Bash Kernel
# Grap the kubectl credential
kubectl hpecp refresh ez-gateway.hpeilab.com --insecure --hpecp-user=hpecli --hpecp-pass=hpecli
kubectl get pods --all-namespaces
kubectl get pods --namespace=poc-tenant

Run the following command for accessing the bash of pod:

# 1: exec into the pod
kubectl exec -it <pod name> -- /bin/bash
# example
kubectl exec -it testnotebook-controller-6kq7r-0 --namespace=poc-tenant -- /bin/bash

Run the following command for changing the access mode:

# 2: changed the access mode for the core-site.xml
chmod 666 /opt/bluedata/hadoop-2.8.5/etc/hadoop/core-site.xml

And now you can run the HDFS command without error.

Key takeaway

I hope this post offered you some tips on how to do big data analytics using the Apache Spark Python API with less time spent on setting up the environment and more time on digging out business insight from your data. Keep this notebook handy so you can refer back to it often. Also, keep an eye out on the HPE DEV blog to make sure you catch future articles on this subject.

[HPE Dev] Data Analytics with PySpark using HPE Ezmeral Container Platform

Table of contents