Dexlab, Author at DexLab Analytics | Big Data Hadoop SAS R Analytics Predictive Modeling & Excel VBA - Page 47 of 80

How to Take the Plunge from IT to Analytics: Explained

With data analytics flourishing the manner it is, a lot of you hailing from IT background are sincerely thinking about making the remarkable switch from IT to Analytics. The skills you possess are transferable and the data structure fascinates you. You know very well, you will make more money in analytics and your career pathway will seek you great rewards. Yet something is stopping you from going!!!

 

How to Take the Plunge from IT to Analytics: Explained

 

What it is? Why are you feeling apprehensive to make the bold move that could change your life and career forever?

Continue reading “How to Take the Plunge from IT to Analytics: Explained”

Program Director Tanmoy Ganguli’s Research Papers Selected for Convergence 2017 at IFIM B School

We are happy to announce that our honorable Program Director, Tanmoy Ganguli’s couple of research papers titled: The Global Financial Crisis, Sensex and Stock Market Bubbles: A tale re-told and The Modelling of Allowances for Loans and Lease Losses During Stock Market Bubbles got selected for presentation at the 12th Annual Conference of IFIM Business School, Bangalore. No wonder, we the entire team at DexLab Analytics is feeling ecstatic and why not we have all the reasons in the world!!

 
Program Director Tanmoy Ganguli’s Research Papers Selected for Convergence 2017 at IFIM B School
 

Both the research papers are written and crafted by Tanmoy Ganguli, who has great expertise and fulfilling knowledge in the field of Credit Risk Modeling, SAS and Regression Models. Before setting his foot in the world of analytics, he worked as an assistant professor at a reputable college in Kolkata, hence happened to gain a lot of experience in teaching and managing students. His papers are intensely well-researched and hence got selected for presentation at Convergence 2017, which took place on 15th and 16th September 2017 at IFIM Business School, Bangalore. The theme of the event was Management 2022: Growth and Sustainability Challenges.

Continue reading “Program Director Tanmoy Ganguli’s Research Papers Selected for Convergence 2017 at IFIM B School”

How Many Category 5 Hurricanes Have We Had in the Atlantic?

With Hurricane Irma battering the pretty state of Florida while ripping through the Caribbean like a mammoth buzzsaw blade, we start wondering how often such rare Category 5 hurricanes occur. We know hurricanes of such magnitude are rare, but how much rare?

According to Wikipedia page information, hurricanes having wind speed greater than or equal to 157 mph are termed as category 5 hurricanes, enough to wreak havoc around. Using SAS Analytics, let’s start digging some data to unravel how many hurricanes of such great magnitude have hit the Atlantic coastal towns and cities with such dangerous wind speeds..

After indulging in a bit of research work, we came across weather.unisys.com website that contains exhaustive data about all the past hurricanes formed out of Atlantic. It turned out to be a good repository of data – we jotted down a bit of code and parsed the data into SAS data set. Next, we marked all the Atlantic hurricane paths on a map, and highlighted the line segments in bright red, where the wind speed fell under the Category 5 tab. So, come know how often they have taken place, along with their accurate position.

sas certification

Hit the above image to view the full-size interactive version of the map with HTML mouse-over text displaying the hurricane names in red for category 5 descriptions.

Take a look at the technical details of the code we used to draw the map:

  1. The map is created using SAS/Graph Proc GMap.
  2. We projected the map with the help of Proc GProject. Followed to that, we saved the projection parameters using the brand new Parmout=option. It was only then that we can project the hurricane paths individually, using GProject’s parmin=option. (Before the innovation of parmout/parmin parameters, we used to combine the map and hurricane paths, compile and project them together, and then divide the results into two separate datasets – of course the new functionality eases the things out).
  3. The paths of hurricane were plotted using regular ‘move’ and ‘draw’ Annotate functions.
  4. We first plotted the land areas (choropleth map), then covered (annotated) the hurricane tracks (while doing so, make sure the red lines lie on top for better visibility), and finally overlaid the country border contours on top again so as to make them prominent.
  5. As lines are incompatible with mouse-over text, we annotated circles using mouse-over text along the red hurricane paths. We outlined these circles at the very beginning (using when=’b’), hence they would become invisible later.

Have a look at the table we presented below. The table comprises of 34 Category 5 Atlantic hurricanes, derived from 150 years of data. You can also run your eyes through a snapshot image – click on the image to see the entire interactive table. And if you are interested in knowing more, hit each hurricane name and ask Google to give you information.

sas training institute

Nota Bene: There might be some hurricanes under Category 5 domain we missed out. Kindly excuse us there, but we think we have nicely hit our main point of discussion. If you have anything to say us, scroll down and comment!

DexLab Analytics is reckoned to be the best SAS analytics training in Pune. The courses are a collaboration of intensive subject matter research and industry experts’ relentless dedication towards their students. For state-of-the-art SAS training courses in Pune, drop by DexLab Analytics.

Interested in a career in Data Analyst?

To learn more about Machine Learning Using Python and Spark – click here.
To learn more about Data Analyst with Advanced excel course – click here.
To learn more about Data Analyst with SAS Course – click here.
To learn more about Data Analyst with R Course – click here.
To learn more about Big Data Course – click here.

Write ETL Jobs to Offload the Data Warehouse Using Apache Spark

Write ETL Jobs to Offload the Data Warehouse Using Apache Spark

The surge of Big Data is everywhere. The evolving trends in BI have taken the world in its stride and a lot of organizations are now taking the initiative of exploring how all this fits in.

Leverage data ecosystem to its full potential and invest in the right technology pieces – it’s important to think ahead so as to reap maximum benefits in IT in the long-run.

“By 2020, information will be used to reinvent, digitalize or eliminate 80% of business processes and products from a decade earlier.” – Gartner’s prediction put it so right!

The following architecture diagram entails a conceptual design – it helps you leverage the computing power of Hadoop ecosystem from your conventional BI/ Data warehousing handles coupled with real time analytics and data science (data warehouses are now called data lakes).

moderndwarchitecture

In this post, we will discuss how to write ETL jobs to offload data warehouse using PySpark API from the genre of Apache Spark. Spark with its lightning-fast speed in data processing complements Hadoop.

Now, as we are focusing on ETL job in this blog, let’s introduce you to a parent and a sub-dimension (type 2) table from MySQL database, which we will merge now to impose them on a single dimension table in Hive with progressive partitions.

Stay away from snow-flaking, while constructing a warehouse on hive. It will reduce useless joins as each join task generates a map task.

Just to raise your level of curiosity, the output on Spark deployment alone in this example job is 1M+rows/min.

The Employee table (300,024 rows) and a Salaries table (2,844,047 rows) are two sources – here employee’s salary records are kept in a type 2 fashion on ‘from_date’ and ‘to_date’ columns. The main target table is a functional Hive table with partitions, developed on year (‘to_date’) from Salaries table and Load date as current date. Constructing the table with such potent partition entails better organization of data and improves the queries from current employees, provided the to_date’ column has end date as ‘9999-01-01’ for all current records.

The rationale is simple: Join the two tables and add load_date and year columns, followed by potent partition insert into a hive table.

Check out how the DAG will look:

screen-shot-2015-09-28-at-1-44-32-pm

Next to version 1.4 Spark UI conjures up the physical execution of a job as Direct Acyclic Graph (the diagram above), similar to an ETL workflow. So, for this blog, we have constructed Spark 1.5 with Hive and Hadoop 2.6.0

Go through this code to complete your job easily: it is easily explained as well as we have provided the runtime parameters within the job, preferably they are parameterized.

Code: MySQL to Hive ETL Job

__author__ = 'udaysharma'
# File Name: mysql_to_hive_etl.py
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, HiveContext
from pyspark.sql import functions as sqlfunc

# Define database connection parameters
MYSQL_DRIVER_PATH = "/usr/local/spark/python/lib/mysql-connector-java-5.1.36-bin.jar"
MYSQL_USERNAME = '<USER_NAME >'
MYSQL_PASSWORD = '********'
MYSQL_CONNECTION_URL = "jdbc:mysql://localhost:3306/employees?user=" + MYSQL_USERNAME+"&password="+MYSQL_PASSWORD 

# Define Spark configuration
conf = SparkConf()
conf.setMaster("spark://Box.local:7077")
conf.setAppName("MySQL_import")
conf.set("spark.executor.memory", "1g")

# Initialize a SparkContext and SQLContext
sc = SparkContext(conf=conf)
sql_ctx = SQLContext(sc)

# Initialize hive context
hive_ctx = HiveContext(sc)

# Source 1 Type: MYSQL
# Schema Name  : EMPLOYEE
# Table Name   : EMPLOYEES
# + --------------------------------------- +
# | COLUMN NAME| DATA TYPE    | CONSTRAINTS |
# + --------------------------------------- +
# | EMP_NO     | INT          | PRIMARY KEY |
# | BIRTH_DATE | DATE         |             |
# | FIRST_NAME | VARCHAR(14)  |             |
# | LAST_NAME  | VARCHAR(16)  |             |
# | GENDER     | ENUM('M'/'F')|             |
# | HIRE_DATE  | DATE         |             |
# + --------------------------------------- +
df_employees = sql_ctx.load(
    source="jdbc",
    path=MYSQL_DRIVER_PATH,
    driver='com.mysql.jdbc.Driver',
    url=MYSQL_CONNECTION_URL,
    dbtable="employees")

# Source 2 Type : MYSQL
# Schema Name   : EMPLOYEE
# Table Name    : SALARIES
# + -------------------------------- +
# | COLUMN NAME | TYPE | CONSTRAINTS |
# + -------------------------------- +
# | EMP_NO      | INT  | PRIMARY KEY |
# | SALARY      | INT  |             |
# | FROM_DATE   | DATE | PRIMARY KEY |
# | TO_DATE     | DATE |             |
# + -------------------------------- +
df_salaries = sql_ctx.load(
    source="jdbc",
    path=MYSQL_DRIVER_PATH,
    driver='com.mysql.jdbc.Driver',
    url=MYSQL_CONNECTION_URL,
    dbtable="salaries")

# Perform INNER JOIN on  the two data frames on EMP_NO column
# As of Spark 1.4 you don't have to worry about duplicate column on join result
df_emp_sal_join = df_employees.join(df_salaries, "emp_no").select("emp_no", "birth_date", "first_name",
                                                             "last_name", "gender", "hire_date",
                                                             "salary", "from_date", "to_date")

# Adding a column 'year' to the data frame for partitioning the hive table
df_add_year = df_emp_sal_join.withColumn('year', F.year(df_emp_sal_join.to_date))

# Adding a load date column to the data frame
df_final = df_add_year.withColumn('Load_date', F.current_date())

df_final.repartition(10)

# Registering data frame as a temp table for SparkSQL
hive_ctx.registerDataFrameAsTable(df_final, "EMP_TEMP")

# Target Type: APACHE HIVE
# Database   : EMPLOYEES
# Table Name : EMPLOYEE_DIM
# + ------------------------------- +
# | COlUMN NAME| TYPE   | PARTITION |
# + ------------------------------- +
# | EMP_NO     | INT    |           |
# | BIRTH_DATE | DATE   |           |
# | FIRST_NAME | STRING |           |
# | LAST_NAME  | STRING |           |
# | GENDER     | STRING |           |
# | HIRE_DATE  | DATE   |           |
# | SALARY     | INT    |           |
# | FROM_DATE  | DATE   |           |
# | TO_DATE    | DATE   |           |
# | YEAR       | INT    | PRIMARY   |
# | LOAD_DATE  | DATE   | SUB       |
# + ------------------------------- +
# Storage Format: ORC


# Inserting data into the Target table
hive_ctx.sql("INSERT OVERWRITE TABLE EMPLOYEES.EMPLOYEE_DIM PARTITION (year, Load_date) \
            SELECT EMP_NO, BIRTH_DATE, FIRST_NAME, LAST_NAME, GENDER, HIRE_DATE, \
            SALARY, FROM_DATE, TO_DATE, year, Load_date FROM EMP_TEMP")

As we have the necessary configuration mentioned in our code, we will simply call to run this job

spark-submit mysql_to_hive_etl.py

As soon as the job is run, our targeted table will consist 2844047 rows just as expected and this is how the partitions will appear:

screen-shot-2015-09-29-at-12-42-37-am

2

3

screen-shot-2015-09-29-at-12-46-55-am

The best part is that – the entire process gets over within 2-3 mins..

For more such interesting blogs and updates, follow us at DexLab Analytics. We are a premium Big Data Hadoop institute in Gurgaon catering to the needs of aspiring candidates. Opt for our comprehensive Hadoop certification in Delhi and crack such codes in a jiffy!

 

Interested in a career in Data Analyst?

To learn more about Data Analyst with Advanced excel course – Enrol Now.
To learn more about Data Analyst with R Course – Enrol Now.
To learn more about Big Data Course – Enrol Now.

To learn more about Machine Learning Using Python and Spark – Enrol Now.
To learn more about Data Analyst with SAS Course – Enrol Now.
To learn more about Data Analyst with Apache Spark Course – Enrol Now.
To learn more about Data Analyst with Market Risk Analytics and Modelling Course – Enrol Now.

Facebook and Microsoft Introduces ONNX: A New Open Ecosystem to Boost AI Innovation

It’s time to move beyond Artificial Intelligence frameworks. Recently, a joined effort from the Digital giants Microsoft and Facebook has paved the pathway for developers to move beyond traditional AI frameworks. The Open Neural Network Exchange (ONNX) format announced the other day that Facebook and Microsoft are on a lookout to boost AI interoperability and innovation. This piece of information was published in their own blog posts, and from there it got viral.

 
Facebook and Microsoft Introduces ONNX: A New Open Ecosystem to Boost AI Innovation
 

In Facebook’s blog post, the Social Media behemoth clearly defined its new effort is “toward an open ecosystem where AI developers can easily move between state-of-the-art tools and choose the combination that is best for them.”

Continue reading “Facebook and Microsoft Introduces ONNX: A New Open Ecosystem to Boost AI Innovation”

Banks Merged With Fintech Startups to Perform Better Digitally

Axis Bank has acquired FreeCharge, a mobile wallet company opening doors to many such deals in the future. As a consequence, do you think banks and fintech startups have started working towards a common goal?

 
Banks Merged With Fintech Startups to Perform Better Digitally
 

On some day in the early 2016, Rajiv Anand, the Executive Director of Retail Banking at Axis Bank, asked his team who were hard at work, “Do present-day customers know how a bank would look in the future?”

Continue reading “Banks Merged With Fintech Startups to Perform Better Digitally”

What the Future Holds for Risk Management in Banking

The past decade saw some impressive changes brought into the aorta of risk management. And the change is showing no signs of slowing down, now.

 
What the Future Holds for Risk Management in Banking
 

In order to keep pace with the changing times, you need to get to the crux of these five trends that are shaping the role of risk management in banking sector: Continue reading “What the Future Holds for Risk Management in Banking”

Sources Of Banking Risks: Credit, Market And Operational Risks

Sources Of Banking Risks: Credit, Market And Operational Risks

Banking risk refers to the future uncertainty which creates stochasticity in the cash flow from receivables of outstanding balances. Banking Risks can be described in the Vonn-Neumann-Morgenstern (VNM) framework of Money lotteries. In this framework, the set of outcomes are assumed to be continuous and monetary in nature, and the lottery is a list of probabilities associated with the continuous outcomes. When applied to the banking framework, the cash flows (the set of outcomes) are assumed to be continuous and stochastic in nature. A theoretical model for the risk is represented in the framework below:


Data Science Machine Learning Certification

There are three broad sources from which banking risks originate: 1. Credit Risk 2. Market Risk 3. Operational Risk.

CREDIT RISK:

Credit Risk arises when the borrower defaults to honour the repayment commitments on their debts. Such a risk arises as a result of adverse selection (screening) of applicants at the stage of acquisitions or due to a change in the financial capabilities of the borrower over the process of repayment. A loan will default if the borrower’s assets (A) at maturity (T) falls below the contractual value of the obligations payable (B) (Vasicek,1991). Let A_i be the asset of the i-th borrower, which is described by the process:

MARKET RISK:

Market Risk includes the risk that arises for banks from fluctuation of the market variables like: Asset Prices, Price levels, Unemployment rate etc. This risk arises from both on-balance sheet as well as off-balance sheet items. This risk includes risk arising from macroeconomic factors such as sharp decline in asset prices and adverse stock market movements. Recessions and sudden adverse demand and supply shock also affect the delinquency rates of the borrowers. Market Risk includes a whole family of risk which includes: stock market risks, counterparty default risk, interest rate risk, liquidity risk, price level movements etc.

OPERATIONAL RISK:

Operational Risk arises from the operational inefficiencies of the human resources and business processes of an organisation. Operational risk includes Fraud risks, bankruptcy risks, risks arising from cyber hacks etc. These risks are uncorrelated across the industries and is very organisation specific. However, Operational risk excludes strategy risk and reputation risk.

This blog is the continuation of the first blog, which was on the topic – The Basics of the Banking Business and Lending Risks. To read the blog, click here ― www.dexlabanalytics.com/blog/the-basics-of-the-banking-business-and-lending-risks

Stay glued to our site for further details about banking structure and risk modelling. DexLab Analytics offers a unique module on credit risk analysis training in Bangalore.

 

Interested in a career in Data Analyst?

To learn more about Data Analyst with Advanced excel course – Enrol Now.
To learn more about Data Analyst with R Course – Enrol Now.
To learn more about Big Data Course – Enrol Now.

To learn more about Machine Learning Using Python and Spark – Enrol Now.
To learn more about Data Analyst with SAS Course – Enrol Now.
To learn more about Data Analyst with Apache Spark Course – Enrol Now.
To learn more about Data Analyst with Market Risk Analytics and Modelling Course – Enrol Now.

5 Ways to Enhance Value of Your Dashboards Using Maps

Today, an effective dashboard is mostly insight-driven. And since a good lot of analysis projects stand upon spatial data, playing with maps is an indispensable skill you need to have in your visualization toolbox.  

 
5 Ways to Enhance Value of Your Dashboards Using Maps
 

Here, we will like to share a few handy tips to improve the analytic and aesthetic value of maps in your dashboard:

Continue reading “5 Ways to Enhance Value of Your Dashboards Using Maps”

Call us to know more