Hadoop has traditionally been a royal pain to setup and configure properly. With recent Cloudera’s distribution releases, this process has gotten simpler, but is still a far cry from straightforward. We’ll try to, if not simplify it, at least document it thoroughly so you follow clear, step-by-step instructions to get your first Hadoop cluster up and running locally. Let’s dive in!
Prerequisites
This tutorial requires the following two hefty installers downloaded to your workstation:
- Oracle VirtualBox to in order to run Virtual Machine Images (VMs) on your machine. Here is the link to the Virtual Box download page:
- An Ubuntu 10 Image that will house our Hadoop installation. You can grab one from here:
NOTE: as of this writing, Cloudera’s Hadoop distribution was not compatible with Ubuntu 11. Just pick the version 10.04 LTS from the downloads drop-down menu to avoid any issues with your installation.
Install VirtualBox
- Download the installation package for your operating system (Windows or Mac OS X recommended).
- Close all applications and run the installation package following the on screen instructions.
NOTE: The current tested version is 4.0.8 (08/05/2011).
Install Ubuntu 10 Image
- Download Ubuntu OS Version 10.04 LTS.
- Start VirtualBox from application selection menu:
- Click on the New button to create new virtual machine and click continue
- Provide a name for your VM and select Linux and Ubuntu in OS options
- Keep the rest of the settings as defaults and continue with instructions
- Start the VM after it was created by selecting the VM in the left screen and clicking on the Start button
7. Select installation media as the downloaded Ubuntu installation package
- Proceed with default settings during the installation.
NOTE: the user hadoop is reserved and should not be selected as your user.
- Restart your VM OS after the installation has been completed. You should see the following screen:
Install Java JDK and Hadoop
- Open new terminal by going to Applications => Accessories => Terminal.
2. Check the release version of the Ubuntu by running the following command:
The expected output should be lucid
3. Inside the Terminal, create an empty file /etc/apt/sources.list.d/cloudera.list
by running the following command:
sudo vi /etc/apt/sources.list.d/cloudera.list
4.
Paste the following two lines into the file
deb http://archive.cloudera.com/debian lucid-cdh3 contrib
deb-src http://archive.cloudera.com/debian lucid-cdh3 contrib
5. Run the following commands in the terminal window:
sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
sudo apt-get update
sudo apt-get install sun-java6-jdk
sudo apt-get install hadoop-0.20
6. Install Hadoop components:
sudo apt-get install hadoop-0.20-namenode
sudo apt-get install hadoop-0.20-datanode
sudo apt-get install hadoop-0.20-jobtracker
sudo apt-get install hadoop-0.20-tasktracker
7. Install configuration for pseudo distributed cluster:
sudo apt-get install hadoop-0.20-conf-pseudo
8. Start services by running the following command in the terminal window:
for x in /etc/init.d/hadoop-* ; do sudo $x start; done
9. Check your installation by opening the following links in your internet browser:
http://localhost:50070
http://localhost:50030