This experiment was done in order to have first understanding on how to deal with big data on cloud. So this is my first time experience with lots of new technologies such as Hadoop, Hive etc. Before the experiment I have tried few examples with Hadoop and other related technologies, and yet found that this would be a better way to go.
This experiment is based on the dataset with annual salaries for different persons. In this case this is about California, USA in 2015. The data set contained names, job titles, basic salary, additional benefits and total salary. So in this, I’ll concentrate on averages of job titles and their averaged total salaries.
Please note that this is an experiment which is solely depends on the dataset. Actual survey reports may contain values different from final results.
Before moving to the rest of the article, you are expected to have better understanding on followings:
- Your understanding on Microsoft Azure platform is necessary. You can create a free azure account and try to play around with portal. Also you need to understand their concepts such as Resources, Resource Groups etc.
- You need to have some sort of understanding on big data concepts such as what is big data, Hadoop DFS, Hadoop Eco system, Hive, SQL etc.
- Better to have done some tutorials in Azure documentation. Specially “Analyze flight delay data”, which laid the foundation for this article.
- Understanding on HTML, PHP, JS with AJAX is required for understand visualization part.
- Using tools like Putty, WinSCP if you are in Windows, else commands related to scp and ssh
Deployment required for this article is can be shown as follows.
Please note that I’ll shutdown HD Insight cluster once it completed its job, else you’ll lose your credits!
The diagram shows different steps that needs in this experiment.
- First you need a dataset to examine.
- Then you have to transfer it to Hadoop cluster, and to Hadoop Distributed File System (HDFS) from there.
- Thereafter you will run required Hive queries to extract the essence of the dataset. After processing we’ll move those data to a SQL database for ease of accessing.
- Finally you need to develop a PHP application which runs on an App server node to visualize results.
You can download San Francisco annual income dataset from following location:
After downloading, you need to open that using excel. I have observed that several job titles contains comma in text. So use find & replace to replace commas in it with hyphen character. Since this is a CSV file, those commas will negatively affect our analysis.
Put the CSV inside a zip folder to minimize data transfer.
Setting up Azure resources
Now you need to create Azure resources required. So let’s start with HD Insight. In this case we need to create a Hadoop-Linux cluster, with 2 worker nodes. Following guide will help to create such cluster quickly.
Also for further information about cluster types and concepts, you may look at following link:
Unfortunately Azure still hasn’t way to deactivate HD Insight cluster when idle. You need to manually delete it, else you’ll be charged for idle hours too.
Thereafter you need to create SQL database. Following tutorial will help on that:
https://azure.microsoft.com/en-us/documentation/articles/sql-database-get-started/#create-a-new-azure-sql-database (Create a new Azure SQL database section)
Finally you need to create an empty App Service. For further information about App Service you may refer following:
This App Service will contain PHP runtime which will be needed at the last part of the article.
A best practice would be, when creating above resources would be to allocate all the resources to a single resource group, which makes it easy to manage.
Also make sure to give strong passwords and remember those.
First you need to transfer zip file created to Hadoop nodes’s file system. You can do it by using scp command or any GUI tool which does scp.
As the host, you need to mention “CLUSTERNAME-ssh.azurehdinsight.net”. Along with that, you need to provide ssh credentials.
scp FILENAME.zip USERNAME@CLUSTERNAME-ssh.azurehdinsight.net:
Then you need to access that node using SSH. In windows you can use Putty tool, others may use terminal.
Unzip the file uploaded
Next, you have to move the csv file to Hadoop File System. Use following commands to create a new directory in HDFS and move the csv file
hdfs dfs -mkdir –p /sf/salary/data hdfs dfs -put FILENAME.csv /sf/salary/data
Now, you need to create Hive query and execute. You can execute Hive query using a file or in an interactive manner. We’ll do in both manner, but first using the file.
Create the following “salaries.hql” using nano
and add the following queries:
DROP TABLE salaries_raw; -- Creates an external table over the csv file CREATE EXTERNAL TABLE salaries_raw ( EMPLOYEE_NAME string, JOB_TITLE string, BASE_PAY float, OVERTIME_PAY float, OTHER_PAY float, BENEFITS float, TOTAL_PAY float, TOTAL_PAY_N_BENEFITS float, YEAR string, NOTES string, AGENCY string, STATUS string) -- The following lines describe the format and location of the file ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/sf/salary/data'; -- Drop the salaries table if it exists DROP TABLE salaries; -- Create the salaries table and populate it with data -- pulled in from the CSV file (via the external table defined previously) CREATE TABLE salaries AS SELECT JOB_TITLE AS job_title, BASE_PAY AS base_pay, TOTAL_PAY AS total_pay, TOTAL_PAY_N_BENEFITS AS total_pay_n_benefits FROM salaries_raw;
You can also locally create “salaries.hql” and upload via SCP.
The queries are self-explanatory, but to make it easy, each query ends with a semicolon. Table “salaries_raw” is creating to directly extract values in CSV. So first query has one-to-one mapping with csv data. Data to the table is taken from where we stored csv file. Thereafter “salaries” table is created using “salaries_raw” table. The “salaries” table filters values of base_pay, total_pay and total_pay_n_benifits columns only. Those columns are selected, because only those would necessary for the next level.
To execute the HIVE query, use the following command
beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http' -n admin -f salaries.hql
Next part of the Hive query we’ll going to do with interactive manner. You can open interactive shell with command:
beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http' -n admin
and enter following commands:
INSERT OVERWRITE DIRECTORY '/sf/salary/output' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' SELECT job_title, avg(base_pay), avg(total_pay), avg(total_pay_n_benefits) FROM salaries WHERE base_pay IS NOT NULL AND total_pay IS NOT NULL AND total_pay_n_benefits IS NOT NULL GROUP BY job_title;
The above Hive query will output the result to “/sf/salary/output” folder. It’ll group the job title and get the average values of base_pay, total_pay and total_pay_n_benifits columns.
Use “!quit” command to exit from interactive shell.
At this stage, we have successfully completed extracting essence of the dataset. Next, we need to make it ready for presentation.
For presentation, we’re going to copy output data to the SQL database created.
To create a table and do other interactions with SQL database, we need to install FreeTDS to Hadoop node. Use following command to install and verify the connectivity.
sudo apt-get --assume-yes install freetds-dev freetds-bin TDSVER=8.0 tsql -H <serverName>.database.windows.net -U <adminLogin> -P <adminPassword> -p 1433 -D <databaseName>
Once you execute the last command, you’ll be directed to another interactive shell where you can interact with the database you created, when creating SQL node.
Use following commands to create a table to put the output we got
CREATE TABLE [dbo].[salaries]( [job_title] [nvarchar](50) NOT NULL, [base_pay] float, [total_pay] float, [total_pay_n_benefits] float, CONSTRAINT [PK_delays] PRIMARY KEY CLUSTERED ([job_title] ASC)) GO
Use “exit” command to exit from SQL interactive session.
To move the data from HDFS to SQL database, we are going to use Sqoop. Following Sqoop command with put output data in HDFS to SQL database.
sqoop export --connect 'jdbc:sqlserver://<serverName>.database.windows.net:1433;database=<databaseName>' --username <adminLogin> --password <adminPassword> --table ' salaries' --export-dir 'wasbs:/// sf/salary/output' --fields-terminated-by '\t' -m 1
Once the task is successfully completed, you can again log in to SQL interaction session and execute following to view table results:
SELECT * FROM salaries GO
Finally you need to use FTP to connect with App Services node, and FTP the following PHP files (including js foolder).
You need to change SQL server host and credentials in “dbconnect.php” file. I’ll leave the rest of code in PHP file as a self-explanatory for you. If you successfully created the app, you should see something similar to following:
In this article I have shown you how I did my first experiment with Azure big data analysis. Along the path I had to cover several other related technologies such as Spark, Azure Stream Analysis. So there are pros and cons with using those technologies. In such cases like analyzing annual income, it’s generally accepted to use Hadoop along with Hive. But if you want to do more frequent activities, you may look in to alternatives.
- Project sourcecode – https://github.com/Buddhima/AzureSFSalaryApp
- Get started with Hadoop in Windows – https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-tutorial-get-started-windows
- Analyze Flight Delays with HD Insight – https://azure.microsoft.com/en-us/documentation/articles/hdinsight-analyze-flight-delay-data-linux
- Use Hive and HiveQL with Hadoop in HDInsight – https://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-hive/