[HPE Dev] Getting Started with DataTaps in Kubernetes Pods

·

6 min read

[HPE Dev] Getting Started with DataTaps in Kubernetes Pods

This blog is posted along with Blog from HPE Dev

What is DataTap?

Handling different protocols of file systems is always a pain for a data analyst. DataTap is a file system connector that aims to alleviate this pain. DataTap provides HDFS protocol abstraction that allows big data applications like Spark to run unmodified with fast access to data sources other than HDFS, i.e. HPE Ezmeral Data Fabric XD (formerly named MapR-FS/XD) and GCS (Google Cloud Storage). Using DataTap, you can unify your code while the underlying data sources can be swapped from HDFS, MapR-FS. This flexibility allows developers like you to focus more on coding rather than the underlying infrastructure. More information on DataTap can be found here.

In this blog, I will introduce two ways to access DataTaps in Kubernetes clusters managed by HPE Ezmeral Container Platform deployed with a pre-integrated HPE Ezmeral Data Fabric. The first method covers how to access the DataTaps using HDFS Commands and the second focuses on directly reading data from Apache Spark (using pyspark). Here we go!

Enable DataTap when creating KubeDirector App

First and foremost, you have to enable DataTaps while creating a KubeDirector app. This can be done by ticking the "Enable DataTap" box.

image

This will result in mounting a lot of files to /opt/bdfs/ of your pod. If you can see the files in your pod (as shown in the image below), it means that your pod is DataTap enabled and you are now ready to access the files in DataTap.

image

The generic approach can be summarized into these two steps:

  1. Add /opt/bdfs/bluedata-dtap.jar to the classpath.
  2. Configure Hadoop with the following values.
namevalue
fs.dtap.implcom.bluedata.hadoop.bdfs.Bdfs
fs.AbstractFileSystem.dtap.implcom.bluedata.hadoop.bdfs.BdAbstractFS
fs.dtap.impl.disable.cachefalse

Note: fs.dtap.impl.disable.cache can be designated as an option.

Reference: Accessing DataTaps in Kubernetes Pods

Uniform Resource Identifier

In HPE Ezmeral Container Platform, you can see different types of file systems used by the shared storage resources. You can manage different data sources through a GUI while representing files with the same URI. The URI will be in the format of

dtap://datatap_name/some_subdirectory/another_subdirectory/some_file
ScreenshotDescription
imageYou can manage different data source whether they are in MapR filesystem or HDFS.
imageYou can add new DataTap with this screen.
imageYou can upload, delete or rename files using GUI.

Access DataTaps using HDFS commands

Introduction

The Hadoop distributed file system (HDFS) is the key component of the Hadoop ecosystem. HDFS commands, of course, are the commands that are responsible for manipulating files for HDFS.

To use the HDFS commands, first you need to start the Hadoop services using the following steps:

Prepare Hadoop

Some of the KubeDirector App provided by HPE is pre-installed a well-configured Hadoop for you. Hence, the following installation steps can be skipped.

Install OpenJDK and the dependency

apt update && apt upgrade -y
apt install wget -y

# install openjdk
DEBIAN_FRONTEND=noninteractive apt-get install openjdk-11-jdk-headless -y

Download Hadoop and untar Hadoop

You can always find the latest version of Hadoop on Apache Hadoop Releases.

wget https://apache.website-solution.net/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz   # Download Hadoop binary
tar zxf hadoop-*.tar.gz                                                                   # Untar Hadoop binary
mv hadoop-3.3.0 $HOME/hadoop                                                              # Rename and move Hadoop folder to $HOME
cd $HOME/hadoop                                                                           # Move directory to hadoop

Configure the required environment

In $HADOOP_HOME/etc/hadoop/hadoop-env.sh file, assign the following environment variables ($JAVA_HOME, $HADOOP_HOME, $HADOOP_CLASSPATH):

# These two variables is needed for HDFS command. Located at line 54, 58.
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
export HADOOP_HOME=$HOME/hadoop

# This variable is DataTap specific. Located at line 126.
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/lib/:/opt/bdfs/bluedata-dtap.jar

In $HADOOP_HOME/etc/hadoop/core-site.xml file, configure Hadoop with the following values:

<configuration>
  <property>
    <name>fs.dtap.impl</name>
    <value>com.bluedata.hadoop.bdfs.Bdfs</value>
  </property>

  <property>
    <name>fs.AbstractFileSystem.dtap.impl</name>
    <value>com.bluedata.hadoop.bdfs.BdAbstractFS</value>
  </property>

  <property>
    <name>fs.dtap.impl.disable.cache</name>
    <value>false</value>
  </property>
</configuration>

Alternative

I have prepared an example configuration file on Github. If your Hadoop does not have a special configuration, you can simply download and replace your existing configuration file.

Test your HDFS command

Here are some common commands used to interact with DataTap.

# bash: Current working directory -> $HADOOP_HOME

# Check the version of the Hadoop.
bin/hadoop version

# List the files from the default TenantStorage Data Source.
bin/hdfs dfs -ls dtap://TenantStorage/

# Make new directory user in dtap://TenantStorage/.
bin/hdfs dfs -mkdir dtap://TenantStorage/user

# Move the text files helloworld.txt to "cenz" folder.
bin/hdfs dfs -put helloworld.txt dtap://TenantStorage/cenz
bin/hdfs dfs -put -f helloworld.txt dtap://TenantStorage/cenz # force replacement

# Concatenate a file in dtap.
bin/hdfs dfs -cat dtap://TenantStorage/cenz/helloworld.txt

# Remove a file in dtap.
bin/hdfs dfs -rm dtap://TenantStorage/cenz/helloworld.txt

Tip:

To get rid of the file path bin/, we can add the Hadoop's bin and sbin file to $PATH

export HADOOP_HOME=$HOME/hadoop
export PATH=$PATH:$HADOOP_HOME:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Reference: Hadoop File System Shell Document

Access DataTaps using pyspark

Introduction

PySpark is an interface for Apache Spark in Python. Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Apache Spark can access data from HDFS and, with the extension, file systems managed by DataTap.

Install pyspark

There are lots of ways to install Spark. The simplest way is to install the pyspark package directly using pip install pyspark. Run the following to install the prerequisite packages and pyspark.

# install pyspark & Java
apt-get install python3 -y
apt-get install python3-pip -y
DEBIAN_FRONTEND=noninteractive apt-get install openjdk-11-jdk-headless -y
pip install pyspark

There are two ways to interact with pyspark. The first one is to execute the pyspark command in bash to initiate the pyspark session. The second way is that to treat pyspark as a module that the python kernel can import to. (import pyspark)

Method one

Initiate pyspark session with jars

In order to use datatap with pyspark, you have to add an external jar file as an argument to pyspark. Initiate Spark's interactive shell in python using the following command.

# bash

# Specify the path of the jars files
pyspark --jars /opt/bdfs/bluedata-dtap.jar

After starting the interactive shell, Spark Context and Spark Session are automatically initiated for you.

image

After specifying the Hadoop configurations, you can read files from DataTap just like you normally did with HDFS.

# pyspark

# Specify the Hadoop configurations.
sc._jsc.hadoopConfiguration().set('fs.dtap.impl', 'com.bluedata.hadoop.bdfs.Bdfs')
sc._jsc.hadoopConfiguration().set('fs.AbstractFileSystem.dtap.impl', 'com.bluedata.hadoop.bdfs.BdAbstractFS')

# Commands for reading DataTap file
text = sc.textFile("dtap://TenantStorage/HPE.txt")
text.take(5)

image

Method two

Initiate python and initiate pyspark with jars at runtime

Run the Python Shell first:

# bash
python3

At the Python runtime, add the path of the jar file using the Spark configuration command:

# python
from pyspark import SparkConf, SparkContext

# Specify the path of the jars files.
conf = SparkConf().set("spark.jars", "/opt/bdfs/bluedata-dtap.jar")
sc = SparkContext( conf=conf)

# Specify the Hadoop configurations.
sc._jsc.hadoopConfiguration().set('fs.dtap.impl', 'com.bluedata.hadoop.bdfs.Bdfs')
sc._jsc.hadoopConfiguration().set('fs.AbstractFileSystem.dtap.impl', 'com.bluedata.hadoop.bdfs.BdAbstractFS')

# Commands for reading DataTap file
text = sc.textFile("dtap://TenantStorage/HPE.txt")
text.take(5)

References:

Conclusion

A distributed file system is fundamental for handling large amounts of data. Managing those file systems tends to always be a pain for developers. DataTaps unify different storage resources into a path that different clusters can use. This helps you to get rid of time-consuming copies or transfers of data. More time spent on extracting business insight from your data and less time handling tedious stuff - that's what DataTap can give you. And that's what you get with HPE Ezmeral.