Skip to main content

Hadoop and Spark Installation Guide

1. Download hadoop 3.3.0 and Spark 3.1.0.

In this guide We have used the Spark distribution without hadoop, however you should be able to use the one bundled with Hadoop.

# wget
# wget

2. Create the hadoop user

# useradd hadoop

3. Set hadoop user to be able to ssh to localhost without password

# su hadoop
$ ssh-keygen

4. < press enter to accept all defaults>

$ cat ~/.ssh/ >> ~/.ssh/authorized_keys
$ chmod 600 ~/.ssh/authorized_keys
$ ssh localhost

The authenticity of host 'localhost (::1)' can't be established.
ECDSA key fingerprint is SHA256:my/1wiWdA5gz/3agIXPNk4iINUUzbuFSaLXuTectG8M.
ECDSA key fingerprint is MD5:1b:b7:a3:c7:12:28:e0:98:a6:50:4b:2b:9f:8d:67:2d.
Are you sure you want to continue connecting (yes/no)? yes
$ exit
$ exit

Installing Hadoop

5. Ensure Java is installed

# java -version
openjdk version "1.8.0_282"

6. Create profile.d script to set hadoop variables for all users. Ensure that the JAVA_HOME points to your distribution of JDK.

# vi /etc/profile.d/


export JAVA_HOME





# chmod 755 /etc/profile.d/

# exit

If you are using Spark without Hadoop bundled, consider adding

export SPARK_DIST_CLASSPATH=$(/opt/hadoop/bin/hadoop classpath)

7. Login again and verify variables set.

  # echo $HADOOP_HOME

8. Extract hadoop and spark downloads into /opt/hadoop and /opt/spark

Ensure that they are extracted such that the path is /opt/hadoop/etc instead of /opt/hadoop/hadoop3.3/etc

9. Add

vi /opt/spark/

#!/usr/bin/env python2

import subprocess

lines = subprocess.check_output(['/opt/nec/ve/bin/ps', 'ax']).split('\n')
ves = []
current_ve = None
for line in lines:
if line.startswith("VE Node:"):
ve_id = int(line.split(': ')[1])
current_ve = {
'id': ve_id,
'procs': []
elif line.strip().startswith("PID TTY"):
elif len(line.strip()) == 0:
parts = line.split()
proc = {
'pid': parts[0],
'tty': parts[1],
'state': parts[2],
'time': parts[3],
'command': parts[4]

ves.sort(key=lambda x: len(x['procs']) < 8)

ids = ",".join(['"' + str(x['id']) + '"' for x in ves])
print('{"name": "ve", "addresses": [' + ids + ']}')

10. Add

# vi /opt/spark/

#!/usr/bin/env bash

$SPARK_HOME/bin/spark-shell --master yarn \
--conf \
--conf \
--conf \
--conf$SPARK_HOME/ \
--conf$SPARK_HOME/ \
--files $SPARK_HOME/

11. Change ownership of /opt/hadoop /opt/spark to hadoop user.

# chown -R hadoop /opt/hadoop
# chown -R hadoop /opt/spark
# chgrp -R hadoop /opt/hadoop
# chgrp -R hadoop /opt/spark

12. Install pdsh

# yum install pdsh

13. Set hadoop configuration. Take note that if you have no GPUs installed in your system, exclude the GPU related configurations in yarn-site.xml, container-executor.cfg, and resource-types.xml.

# su hadoop
$ vi /opt/hadoop/etc/hadoop/core-site.xml


$ vi /opt/hadoop/etc/hadoop/hdfs-site.xml


$ vi /opt/hadoop/etc/hadoop/mapred-site.xml


$ vi /opt/hadoop/etc/hadoop/yarn-site.xml


$ vi /opt/hadoop/etc/hadoop/container-executor.cfg


$ vi /opt/hadoop/etc/hadoop/resource-types.xml


14. Edit the following section of capacity-scheduler.xml and change the DefaultResourceCalculator to the DominantResourceCalculator setting below:

$ vi /opt/hadoop/etc/hadoop/capacity-scheduler.xml

The ResourceCalculator implementation to be used to compare
Resources in the scheduler.
The default i.e. DefaultResourceCalculator only uses Memory while
DominantResourceCalculator uses dominant-resource to compare
multi-dimensional resources such as Memory, CPU etc.

15. Set up HDFS and YARN

$ cd /opt/hadoop
$ bin/hdfs namenode -format
$ sbin/
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/hadoop
$ bin/hdfs dfs -mkdir /user/&lt;otheruser&gt;
$ bin/hdfs dfs -chown <otheruser> /user/&lt;otheruser&gt;

Repeat the mkdir and chown for <otheruser> for any other users on the system.

16. Setup GPU and Vector engine settings and scripts

$ mkdir /opt/hadoop/sbin/DevicePluginScript
$ vi /opt/hadoop/sbin/DevicePluginScript/

#!/usr/bin/env python
import os
from subprocess import Popen, PIPE

vecmd = Popen('/opt/nec/ve/bin/vecmd info', shell=True, stdout=PIPE)

lines = []
for line in vecmd.stdout:

ve_count = 0
ves = []
current_ve = None
ve_id = 0

for line in lines:
if line.startswith('Attached VEs'):
ve_count = int(line.split()[-1])

if line.startswith('[VE'):
if current_ve != None:
current_ve = {}
ve_id += 1
current_ve['id'] = ve_id
current_ve['dev'] = '/dev/ve' + str(ve_id - 1)
dev = os.lstat(current_ve['dev'])
current_ve['major'] = os.major(dev.st_rdev)
current_ve['minor'] = os.minor(dev.st_rdev)

if line.startswith('VE State'):
current_ve['state'] = line.split()[-1]

if line.startswith('Bus ID'):
current_ve['busId'] = line.split()[-1]


for ve in ves:
print("id={id}, dev={dev}, state={state}, busId={busId}, major={major}, minor={minor}".format(**ve))

17. Start YARN services

$ sbin/

18. Verify spark shell sees VE resources

$ /opt/spark/
2021-05-14 13:38:47,468 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2021-05-14 13:38:54,129 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://aurora06:4040
Spark context available as 'sc' (master = yarn, app id = application_1620964947383_0002).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.1.1

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sc.resources
res0: scala.collection.Map[String,org.apache.spark.resource.ResourceInformation] = Map(ve -> [name: ve, addresses: 0,1])