Data analyst certification Archives - Page 9 of 12 - DexLab Analytics | Big Data Hadoop SAS R Analytics Predictive Modeling & Excel VBA

Write ETL Jobs to Offload the Data Warehouse Using Apache Spark

Write ETL Jobs to Offload the Data Warehouse Using Apache Spark

The surge of Big Data is everywhere. The evolving trends in BI have taken the world in its stride and a lot of organizations are now taking the initiative of exploring how all this fits in.

Leverage data ecosystem to its full potential and invest in the right technology pieces – it’s important to think ahead so as to reap maximum benefits in IT in the long-run.

“By 2020, information will be used to reinvent, digitalize or eliminate 80% of business processes and products from a decade earlier.” – Gartner’s prediction put it so right!

The following architecture diagram entails a conceptual design – it helps you leverage the computing power of Hadoop ecosystem from your conventional BI/ Data warehousing handles coupled with real time analytics and data science (data warehouses are now called data lakes).

moderndwarchitecture

In this post, we will discuss how to write ETL jobs to offload data warehouse using PySpark API from the genre of Apache Spark. Spark with its lightning-fast speed in data processing complements Hadoop.

Now, as we are focusing on ETL job in this blog, let’s introduce you to a parent and a sub-dimension (type 2) table from MySQL database, which we will merge now to impose them on a single dimension table in Hive with progressive partitions.

Stay away from snow-flaking, while constructing a warehouse on hive. It will reduce useless joins as each join task generates a map task.

Just to raise your level of curiosity, the output on Spark deployment alone in this example job is 1M+rows/min.

The Employee table (300,024 rows) and a Salaries table (2,844,047 rows) are two sources – here employee’s salary records are kept in a type 2 fashion on ‘from_date’ and ‘to_date’ columns. The main target table is a functional Hive table with partitions, developed on year (‘to_date’) from Salaries table and Load date as current date. Constructing the table with such potent partition entails better organization of data and improves the queries from current employees, provided the to_date’ column has end date as ‘9999-01-01’ for all current records.

The rationale is simple: Join the two tables and add load_date and year columns, followed by potent partition insert into a hive table.

Check out how the DAG will look:

screen-shot-2015-09-28-at-1-44-32-pm

Next to version 1.4 Spark UI conjures up the physical execution of a job as Direct Acyclic Graph (the diagram above), similar to an ETL workflow. So, for this blog, we have constructed Spark 1.5 with Hive and Hadoop 2.6.0

Go through this code to complete your job easily: it is easily explained as well as we have provided the runtime parameters within the job, preferably they are parameterized.

Code: MySQL to Hive ETL Job

__author__ = 'udaysharma'
# File Name: mysql_to_hive_etl.py
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, HiveContext
from pyspark.sql import functions as sqlfunc

# Define database connection parameters
MYSQL_DRIVER_PATH = "/usr/local/spark/python/lib/mysql-connector-java-5.1.36-bin.jar"
MYSQL_USERNAME = '<USER_NAME >'
MYSQL_PASSWORD = '********'
MYSQL_CONNECTION_URL = "jdbc:mysql://localhost:3306/employees?user=" + MYSQL_USERNAME+"&password="+MYSQL_PASSWORD 

# Define Spark configuration
conf = SparkConf()
conf.setMaster("spark://Box.local:7077")
conf.setAppName("MySQL_import")
conf.set("spark.executor.memory", "1g")

# Initialize a SparkContext and SQLContext
sc = SparkContext(conf=conf)
sql_ctx = SQLContext(sc)

# Initialize hive context
hive_ctx = HiveContext(sc)

# Source 1 Type: MYSQL
# Schema Name  : EMPLOYEE
# Table Name   : EMPLOYEES
# + --------------------------------------- +
# | COLUMN NAME| DATA TYPE    | CONSTRAINTS |
# + --------------------------------------- +
# | EMP_NO     | INT          | PRIMARY KEY |
# | BIRTH_DATE | DATE         |             |
# | FIRST_NAME | VARCHAR(14)  |             |
# | LAST_NAME  | VARCHAR(16)  |             |
# | GENDER     | ENUM('M'/'F')|             |
# | HIRE_DATE  | DATE         |             |
# + --------------------------------------- +
df_employees = sql_ctx.load(
    source="jdbc",
    path=MYSQL_DRIVER_PATH,
    driver='com.mysql.jdbc.Driver',
    url=MYSQL_CONNECTION_URL,
    dbtable="employees")

# Source 2 Type : MYSQL
# Schema Name   : EMPLOYEE
# Table Name    : SALARIES
# + -------------------------------- +
# | COLUMN NAME | TYPE | CONSTRAINTS |
# + -------------------------------- +
# | EMP_NO      | INT  | PRIMARY KEY |
# | SALARY      | INT  |             |
# | FROM_DATE   | DATE | PRIMARY KEY |
# | TO_DATE     | DATE |             |
# + -------------------------------- +
df_salaries = sql_ctx.load(
    source="jdbc",
    path=MYSQL_DRIVER_PATH,
    driver='com.mysql.jdbc.Driver',
    url=MYSQL_CONNECTION_URL,
    dbtable="salaries")

# Perform INNER JOIN on  the two data frames on EMP_NO column
# As of Spark 1.4 you don't have to worry about duplicate column on join result
df_emp_sal_join = df_employees.join(df_salaries, "emp_no").select("emp_no", "birth_date", "first_name",
                                                             "last_name", "gender", "hire_date",
                                                             "salary", "from_date", "to_date")

# Adding a column 'year' to the data frame for partitioning the hive table
df_add_year = df_emp_sal_join.withColumn('year', F.year(df_emp_sal_join.to_date))

# Adding a load date column to the data frame
df_final = df_add_year.withColumn('Load_date', F.current_date())

df_final.repartition(10)

# Registering data frame as a temp table for SparkSQL
hive_ctx.registerDataFrameAsTable(df_final, "EMP_TEMP")

# Target Type: APACHE HIVE
# Database   : EMPLOYEES
# Table Name : EMPLOYEE_DIM
# + ------------------------------- +
# | COlUMN NAME| TYPE   | PARTITION |
# + ------------------------------- +
# | EMP_NO     | INT    |           |
# | BIRTH_DATE | DATE   |           |
# | FIRST_NAME | STRING |           |
# | LAST_NAME  | STRING |           |
# | GENDER     | STRING |           |
# | HIRE_DATE  | DATE   |           |
# | SALARY     | INT    |           |
# | FROM_DATE  | DATE   |           |
# | TO_DATE    | DATE   |           |
# | YEAR       | INT    | PRIMARY   |
# | LOAD_DATE  | DATE   | SUB       |
# + ------------------------------- +
# Storage Format: ORC


# Inserting data into the Target table
hive_ctx.sql("INSERT OVERWRITE TABLE EMPLOYEES.EMPLOYEE_DIM PARTITION (year, Load_date) \
            SELECT EMP_NO, BIRTH_DATE, FIRST_NAME, LAST_NAME, GENDER, HIRE_DATE, \
            SALARY, FROM_DATE, TO_DATE, year, Load_date FROM EMP_TEMP")

As we have the necessary configuration mentioned in our code, we will simply call to run this job

spark-submit mysql_to_hive_etl.py

As soon as the job is run, our targeted table will consist 2844047 rows just as expected and this is how the partitions will appear:

screen-shot-2015-09-29-at-12-42-37-am

2

3

screen-shot-2015-09-29-at-12-46-55-am

The best part is that – the entire process gets over within 2-3 mins..

For more such interesting blogs and updates, follow us at DexLab Analytics. We are a premium Big Data Hadoop institute in Gurgaon catering to the needs of aspiring candidates. Opt for our comprehensive Hadoop certification in Delhi and crack such codes in a jiffy!

 

Interested in a career in Data Analyst?

To learn more about Data Analyst with Advanced excel course – Enrol Now.
To learn more about Data Analyst with R Course – Enrol Now.
To learn more about Big Data Course – Enrol Now.

To learn more about Machine Learning Using Python and Spark – Enrol Now.
To learn more about Data Analyst with SAS Course – Enrol Now.
To learn more about Data Analyst with Apache Spark Course – Enrol Now.
To learn more about Data Analyst with Market Risk Analytics and Modelling Course – Enrol Now.

Quantum Internet Is Now Turning Into a Reality

Quantum Internet Is Now Turning Into a Reality
 

Scientists across the globe are looking forward towards formulating new methods to realize ‘quantum internet’, an unhackable internet, which connects particles linked together by the principle of quantum entanglement. In simple terms, quantum internet will entail multiple particles striking information at each other in the form of quantum signals – but specialists are yet to figure out what it actually does beyond that. The term ‘quantum internet’ is quite sketchy at this moment. There’s no real definition of it as of now.

Continue reading “Quantum Internet Is Now Turning Into a Reality”

Automation Doesn’t Necessarily Make Humans Obsolete, Here’s Why

Machines are going to eat our jobs.

 

AI is handling insurance claims and basic bookkeeping, maintaining investment portfolios, doing preliminary HR tasks, and performing extensive legal research and lot more. So, do humans stand a chance against the automation apocalypse, where everything, almost everything will be controlled by robots?

 
Automation Doesn’t Necessarily Make Humans Obsolete, Here’s Why
 

What do you think? You might be worried about your future job opportunities and universal basic income, but I would ask you to draw a clearer picture about this competing theory – because, in the end, this question might not even be a plausible and completely valid question. Why, I will tell you now.

Continue reading “Automation Doesn’t Necessarily Make Humans Obsolete, Here’s Why”

Keep Pace with Automation: Emerging Data Science Jobs in India

Indian IT market is not yet doomed. In fact, if you look at the larger picture, you will find India is expected to face a shortage of 200000 data scientists by 2020. Where traditional IT jobs are going through a rough patch, new age jobs are surfacing up, according to market reports. Big Data, Artificial Intelligence, the Internet of Things, Cloud Computing, and Cybersecurity are new digital domains that are replacing the old school jobs, like data entry and server maintenance, which are expected to reduce more over the next five years.
The next decade is going to witness most vacancies in these job posts:

However, just because there is a wide array of openings for a web services consultant doesn’t make it the most lucrative job position. Big Data architect job openings are much less in number, but offer handsome pays, according to reports.

A median salary of a web services consultant is Rs 9.27 lakh ($14,461) annually

A median salary of a big data architect is Rs 20.67 lakh ($32,234) annually

Now, tell me, which is better?

As technologies evolve so drastically, it becomes an absolute imperative for the techies to update their skills through short learning programs and crash courses. Data analyst courses will help them to sync in with the latest technological developments, which happens every day, something or the other. Moreover, it’s like a constant process, where they have to learn something every year to succeed in this rat race of technological superiority. Every employee needs to make some time, as well as the companies. The companies also need to facilitate these newer technologies in their systems to keep moving ahead of their tailing rivals.

Re-skill or perish – is the new slogan going around. The urgency to re-skill is creating a spur among employees with mid-level experience. If you check the surveys, you will find around 57% of the 7000 IT professionals looking forward to enroll for a short time learning course have at least 4 to 10 years of work experience. Meanwhile, a mere 11% of those who are under 4 years of experience are looking out for such online courses. It happens because, primary-stage employees are mostly fresh graduates, who receives in-house training from their respective companies, hence they don’t feel the urge to scrounge through myriad learning resources, unlike their experienced counterparts.

 

 

Today, all big companies across sectors are focusing their attention on data science and analytics, triggering major reinventions in the job profile of a data analyst. Owing to technology updates, “The role of a data analyst is itself undergoing a sea change, primarily because better technology is available now to aid in decision-making,” said Sumit Mitra, head of group human resources and corporate services at GILAC. To draw a closure, data science is the new kid in the block, and IT professionals are imbibing related skills to shine bright in this domain. Contact DexLab Analytics for data analyst course in Delhi. They offer high-in demand data analyst certification courses at the most affordable prices.

 

Televisory Launches Data Analytics & Operational Benchmarking Platform

Televisory Launches Data Analytics & Operational Benchmarking Platform
 

Televisory, a start-up based out of India and Singapore, has launched its data analytics and operational benchmarking platform. The platform can measure real-time operational and financial performance of companies. While the firm has chosen to launch its platform from the US, its services are available globally.

Continue reading “Televisory Launches Data Analytics & Operational Benchmarking Platform”

Business Intelligence: Now Every Person Can Use Data to Make Better Decisions

The fascinating world of Business Intelligence is expanding. The role of data scientists is evolving. The mysticism associated with data analytics is breaking off, making a way for non-technical background people to understand and dig deeper into the nuances and metrics of data science.
 
Business Intelligence: Now Every Person Can Use Data to Make Better Decisions
 

“Data democratization is about creating an environment where every person who can use data to make better decisions, has access to the data they need when they need it,” says Amir Orad, CEO of BI software company Sisense. Data is not to be limited only in the hands of data scientists, employees throughout the organization should have easy access to data, as and when required.

Continue reading “Business Intelligence: Now Every Person Can Use Data to Make Better Decisions”

Google Is All Set to Wipe Off Artificial Stupidity

Google Is All Set to Wipe Off Artificial Stupidity

Well, human-AI relation needs to improve. Amazon’s Alexa personal assistant is operating in one of the world’s largest online stores and deserves accolade as it pulls out information from Wikipedia. But what if it can’t play that rad pop banger you just heard and responds saying “I’m sorry, I don’t understand the question,”!! Disappointing, right?

All revered digital helpmates including Google’s Google Assistant and Apple’s Siri are capable of producing frustrating coups that can feel like artificial stupidity. Against this, Google has decided to start a new research push to realize and improve the existing relations between humans and AI. PAIR, for People + AI Research initiative was announced this Monday, and it would be shepherded by two data viz crackerjacks, Fernanda Viégas and Martin Wattenberg.

104476359-google-assistant-5.530x298

Get Machine Learning Certification today. DexLab Analytics is here to provide encompassing Machine Learning courses.

Virtual assistants don’t like to be defeated – they get infuriated when they fail to perform a given task. In this context, Viégas says she is keen to study how people outline expectations regarding what systems can and cannot outperform a command – which is to say how virtual assistants should be designed to prick us toward only asking things that it can perform, leaving no room for disappointment.

Making Artificial Intelligence more transparent among people and not just professionals is going to be a major initiative of PAIR. It also released two open source tools to help data scientists grasp the data they are feeding into the Machine Learning systems. Interesting, isn’t it?

The deep learning programs that have recently gained a lot of appreciation in analyzing our personal data or diagnosing life-threatening diseases is of late said to be dubbed as ‘black boxes’ by polemicist researchers, meaning it can be trickier to observe why a system churn out a specific decision, like a diagnosis. So, here lies the problem. In life and death situations inside clinics, or on-road, while driving autonomous vehicles, these faulty algorithms may pose potent risks. Viégas says “The doctor needs to have some sense of what’s happening and why they got a recommendation or prediction.”

Googleplex-Google-Logo-AH-6

Google’s project comes at a time when the human consequences of AI are being questioned the most. Recently, the Ethics and Governance of Artificial Intelligence Fund in association with the Knight Foundation and LinkedIn cofounder Reid Hoffman declared $7.6 million in grants to civil society organizations to review the changes AI is going to cause in labor markets and criminal justice structures. Similarly, Google announces most of PAIR’s work will take place in the open. MIT and Harvard professors Hal Abelson and Brendan Meade are going to join forces with PAIR to study how AI can improve education and science.

google_io_2017_ai_1499777827549

Closing Thoughts – If PAIR can integrate AI seamlessly into prime industries, like healthcare, it would definitely shape roads for new customers to reach Google’s AI-centric cloud business destination. Viégas reveals she will also like to work closely with Google’s product teams, like the ones responsible for developing Google Assistant. According to her, such collaborations are great and comes with an added advantage, as it keeps people hooked to the product, resulting in broader company services. PAIR is a necessary shot to not only help push the society to understand what’s going on between humans and AI but also to boost Google’s bottom line.

DexLab Analytics is your gateway to great career in data analytics. Enroll in a Machine Learning course online and ride on.

 

Interested in a career in Data Analyst?

To learn more about Data Analyst with Advanced excel course – Enrol Now.
To learn more about Data Analyst with R Course – Enrol Now.
To learn more about Big Data Course – Enrol Now.

To learn more about Machine Learning Using Python and Spark – Enrol Now.
To learn more about Data Analyst with SAS Course – Enrol Now.
To learn more about Data Analyst with Apache Spark Course – Enrol Now.
To learn more about Data Analyst with Market Risk Analytics and Modelling Course – Enrol Now.

Speaking with Tanmoy Ganguli, the expert Data Analyst Bringing Cutting Edge Technology to DexLab Analytics

Speaking with Tanmoy Ganguli, the expert Data Analyst Bringing Cutting Edge Technology to DexLab Analytics

 

DexLab Analytics is proud to announce that Tanmoy Ganguli, a proficient Data Analyst who has a long standing experience in Credit Risk Modelling, SAS and regression models is joining our Gurgaon institute as Program Director. Here are some excerpts from an interview we conducted, where he talks about the various challenges he faced in his career and the rapid development of Data Analytics.

Continue reading “Speaking with Tanmoy Ganguli, the expert Data Analyst Bringing Cutting Edge Technology to DexLab Analytics”

Skills required during Interviews for a Data Scientist @ Facebook, Intel, Ebay. Square etc.

Skills required during Interviews for a Data Scientist @ Facebook, Intel, Ebay. Square etc.

Basic Programming Languages: You should know a statistical programming language, like R or Python (along with Numpy and Pandas Libraries), and a database querying language like SQL

Statistics: You should be able to explain phrases like null hypothesis, P-value, maximum likelihood estimators and confidence intervals. Statistics is important to crunch data and to pick out the most important figures out of a huge dataset. This is critical in the decision-making process and to design experiments.

Machine Learning: You should be able to explain K-nearest neighbors, random forests, and ensemble methods. These techniques typically are implemented in R or Python.  These algorithms show to employers that you have exposure to how data science can be used in more practical manners.

Data Wrangling: You should be able to clean up data. This basically means understanding that “California” and “CA” are the same thing – a negative number cannot exist in a dataset that describes population. It is all about identifying corrupt (or impure) data and and correcting/deleting them.

Data Visualization: Data scientist is useless on his or her own. They need to communicate their findings to Product Managers in order to make sure those data are manifesting into real applications. Thus, familiarity with data visualization tools like ggplot is very important (so you can SHOW data, not just talk about them)

Software Engineering: You should know algorithms and data structures, as they are often necessary in creating efficient algorithms for machine learning. Know the use cases and run time of these data structures: Queues, Arrays, Lists, Stacks, Trees, etc.

2

What they look for? @ Mu-Sigma, Fractal Analytics

    • Most of the analytics and data science companies, including third party analytics companies such as Mu-sigma and Fractal hire fresher’s in big numbers (some time in hundreds every year).
    • You see one of the main reasons why they are able to survive in this industry is the “Cost Arbitrage” benefit between the US and other developed countries vs India.
    • Generally speaking, they normally pay significantly lower for India talent in India compared to the same talent in the USA. Furthermore, hiring fresh talent from the campuses is one of the key strategies for them to maintain the low cost structure.
    • If they are visiting your campuses for interview process, you should apply. In case if they are not visiting your campus, drop your resume to them using their corporate email id that you can find on their websites.
    • Better will be to find someone in your network (such as seniors) who are working for these companies and ask them to refer you. This is normally the most effective approach after the campus placements.

Key Skills that look for are-

  • Love for numbers and quantitative stuff
  • Grit to keep on learning
  • Some programming experience (preferred)
  • Structured thinking approach
  • Passion for solving problems
  • Willingness to learn statistical concepts

Technical Skills

  • Math (e.g. linear algebra, calculus and probability)
  • Statistics (e.g. hypothesis testing and summary statistics)
  • Machine learning tools and techniques (e.g. k-nearest neighbors, random forests, ensemble methods, etc.)
  • Software engineering skills (e.g. distributed computing, algorithms and data structures)
  • Data mining
  • Data cleaning and munging
  • Data visualization (e.g. ggplot and d3.js) and reporting techniques
  • Unstructured data techniques
  • Python / R and/or SAS languages
  • SQL databases and database querying languages
  • Python (most common), C/C++ Java, Perl
  • Big data platforms like Hadoop, Hive & Pig

Business Skills

  • Analytic Problem-Solving: Approaching high-level challenges with a clear eye on what is important; employing the right approach/methods to make the maximum use of time and human resources.
  • Effective Communication: Detailing your techniques and discoveries to technical and non-technical audiences in a language they can understand.
  • Intellectual Curiosity: Exploring new territories and finding creative and unusual ways to solve problems.
  • Industry Knowledge: Understanding the way your chosen industryfunctions and how data are collected, analyzed and utilized.

 

Interested in a career in Data Analyst?

To learn more about Data Analyst with Advanced excel course – Enrol Now.
To learn more about Data Analyst with R Course – Enrol Now.
To learn more about Big Data Course – Enrol Now.

To learn more about Machine Learning Using Python and Spark – Enrol Now.
To learn more about Data Analyst with SAS Course – Enrol Now.
To learn more about Data Analyst with Apache Spark Course – Enrol Now.
To learn more about Data Analyst with Market Risk Analytics and Modelling Course – Enrol Now.

Call us to know more