Setting Up Apache Spark Environment The Standalone Mode

1. Server (Node) & Software Setup for Apache Spark and Hive Meta Store

Powering up Apache Spark Cluster for High-Scale Data Transformations

Master

  • VM for spark Master — 128 Cores / 256 GB
  • Configure Servers with Ubuntu Linux OS (Ubuntu v22.04.5)

Workers

  • 8 x Dell R740 — 96 Cores / 256 GB
  • 8 x Dell R6515 — 128 Cores / 256 GB
  • Configure Servers with Ubuntu Linux OS (Ubuntu v22.04.5)

Object Storage

  • Prepare servers for running Apache Spark & Integrating with suitable Object Storage for Data Platform

2. Steps to Download, Install and Configure Apache Spark in Standalone mode

  • Install Java Development Kit (JDK) version 8 or above
  • Download latest stable release (e.g., Spark 3.5.2) from the Apache Spark download page
  • Set Environment Variables & reload the profile
  • Edit .bashrc or /etc/profile to include Spark-related environment variables

Then,

  • Start the Master Node
$SPARK_HOME/sbin/start-master.sh

Start a Worker Node — By default, workers connect to the master at spark://<hostname>:7077

$SPARK_HOME/sbin/start-worker.sh spark://:7077

Access the Web UI — Monitor the Spark cluster from the web UI at http://<master-IP>:8080

3. Steps to Download, Install and Configure Hive Meta Store with PostgreSql

Hive Meta Store is required to store metadata into the tables and PostgreSql can be used as its backend database to ensure durability and reliability.

  • Download Hive from the Apache Hive download page
  • Edit .bashrc or /etc/profile to include Hive-related environment variables
  • Configure Hive to Use PostgreSQL — Add Postgres Connector to Hive
  • Download the Postgres Connector JAR and place it in the Hive lib directory
  • Copy hive-default.xml.template hive-site.xml located in $HIVE_HOME/conf
  • Configure the hive-site.xml file
  • Install Postgresql & start service
wget https://archive.apache.org/dist/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
tar -xzf apache-hive-3.1.3-bin.tar.gz
mv apache-hive-3.1.3-bin /usr/local/hive
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
source ~/.bashrc
wget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.7.0/postgresql-42.7.0.jar
cp postgresql-42.7.0.jar $HIVE_HOME/lib/
cp postgresql-42.7.0.jar $SPARK_HOME/jars
sudo apt update
sudo apt install postgresql -y
systemctl start postgresql.service
  • Create a database for the Hive Meta Store and a user with appropriate permissions:
su - postgres
psql
CREATE DATABASE hivemetastore;
CREATE USER hiveuser WITH PASSWORD 'hivepassword';
GRANT ALL PRIVILEGES ON DATABASE hivemetastore TO hiveuser;
\q

4. Download, Install and Configure Hadoop

wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
tar -xzf hadoop-3.3.4.tar.gz
mv hadoop-3.3.4 /usr/local/hadoop
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
source ~/.bashrc
  • Configure Hadoop-env.sh (in Hadoop_Home) to access Object Storage.
  • Download core-site.xml and Update the variables like bucket name, s3 access key & secret key and endpoints
export HADOOP_OPTIONAL_TOOLS="hadoop-aws"
hadoop fs -ls s3a://
  • Initialize the Hive Meta Store (HMS) Schema
cd $HIVE_HOME/bin
$HIVE_HOME/bin/schematool -dbType postgres -initSchema

su - postgres
psql
postgres=# \c hivemetastore
#You are now connected to database "hivemetastore" as user "postgres".

hivemetastore=# GRANT CREATE ON SCHEMA public TO hiveuser;
#GRANT

hivemetastore=# \q

5. Connect Apache Spark to Hive Metastore

  • Copy the Hive configuration file (hive-site.xml) to the Spark configuration directory ($SPARK_HOME/conf)
  • Start Spark Shell with Hive Support
cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf/
spark-shell - conf spark.sql.catalogImplementation=hive

6. Validate the Installation

  • Start the HiveServer to interact with Hive through JDBC or Beeline
  • create a test table in the Hive Meta Store
  • Open a Spark shell and run the following commands to query the Hive table:
$HIVE_HOME/bin/hive - service hiveserver2 &
/usr/local/hive/bin/hive - service metastore &
/usr/local/hive/bin/beeline -u jdbc:hive2://

-- # at Hive Prompt

>> CREATE TABLE test_table (id INT, name STRING);
>> spark.sql("SHOW TABLES").show()
>> spark.sql("SELECT * FROM test_table").show()

test_table gets created in Hive Meta Store (HMS) and this can be verified using Spark SQL Describe commands.