R Analytics Certification Archives - DexLab Analytics | Big Data Hadoop SAS R Analytics Predictive Modeling & Excel VBA

## Time Series Analysis & Modelling with Python (Part II) – Data Smoothing

Data Smoothing is done to better understand the hidden patterns in the data. In the non- stationary processes, it is very hard to forecast the data as the variance over a period of time changes, therefore data smoothing techniques are used to smooth out the irregular roughness to see a clearer signal.

In this segment we will be discussing two of the most important data smoothing techniques :-

• Moving average smoothing
• Exponential smoothing

Moving average smoothing

Moving average is a technique where subsets of original data are created and then average of each subset is taken to smooth out the data and find the value in between each subset which better helps to see the trend over a period of time.

Lets take an example to better understand the problem.

Suppose that we have a data of price observed over a period of time and it is a non-stationary data so that the tend is hard to recognize.

 QTR (quarter) Price 1 10 2 11 3 18 4 14 5 15 6 ?

In the above data we don’t know the value of the 6th quarter.

….fig (1)

The plot above shows that there is no trend the data is following so to better understand the pattern we calculate the moving average over three quarter at a time so that we get in between values as well as we get the missing value of the 6th quarter.

To find the missing value of 6th quarter we will use previous three quarter’s data i.e.

MAS =  = 15.7

 QTR (quarter) Price 1 10 2 11 3 18 4 14 5 15 6 15.7

MAS =  = 13

MAS =  = 14.33

 QTR (quarter) Price MAS (Price) 1 10 10 2 11 11 3 18 18 4 14 13 5 15 14.33 6 15.7 15.7

….. fig (2)

In the above graph we can see that after 3rd quarter there is an upward sloping trend in the data.

Exponential Data Smoothing

In this method a larger weight ( ) which lies between 0 & 1 is given to the most recent observations and as the observation grows more distant the weight decreases exponentially.

The weights are decided on the basis how the data is, in case the data has low movement then we will choose the value of  closer to 0 and in case the data has a lot more randomness then in that case we would like to choose the value of  closer to 1.

EMA= Ft= Ft-1 + (At-1 – Ft-1)

Now lets see a practical example.

For this example we will be taking  = 0.5

Taking the same data……

 QTR (quarter) Price(At) EMS Price(Ft) 1 10 10 2 11 ? 3 18 ? 4 14 ? 5 15 ? 6 ? ?

To find the value of yellow cell we need to find out the value of all the blue cells and since we do not have the initial value of F1 we will use the value of A1. Now lets do the calculation:-

F2=10+0.5(10 – 10) = 10

F3=10+0.5(11 – 10) = 10.5

F4=10.5+0.5(18 – 10.5) = 14.25

F5=14.25+0.5(14 – 14.25) = 14.13

F6=14.13+0.5(15 – 14.13)= 14.56

 QTR (quarter) Price(At) EMS Price(Ft) 1 10 10 2 11 10 3 18 10.5 4 14 14.25 5 15 14.13 6 14.56 14.56

In the above graph we see that there is a trend now where the data is moving in the upward direction.

So, with that we come to the end of the discussion on the Data smoothing method. Hopefully it helped you understand the topic, for more information you can also watch the video tutorial attached down this blog. The blog is designed and prepared by Niharika Rai, Analytics Consultant, DexLab Analytics DexLab Analytics offers machine learning courses in Gurgaon. To keep on learning more, follow DexLab Analytics blog.

.

## Time Series Analysis Part I

A time series is a sequence of numerical data in which each item is associated with a particular instant in time. Many sets of data appear as time series: a monthly sequence of the quantity of goods shipped from a factory, a weekly series of the number of road accidents, daily rainfall amounts, hourly observations made on the yield of a chemical process, and so on. Examples of time series abound in such fields as economics, business, engineering, the natural sciences (especially geophysics and meteorology), and the social sciences.

• Univariate time series analysis- When we have a single sequence of data observed over time then it is called univariate time series analysis.
• Multivariate time series analysis – When we have several sets of data for the same sequence of time periods to observe then it is called multivariate time series analysis.

The data used in time series analysis is a random variable (Yt) where t is denoted as time and such a collection of random variables ordered in time is called random or stochastic process.

Stationary: A time series is said to be stationary when all the moments of its probability distribution i.e. mean, variance , covariance etc. are invariant over time. It becomes quite easy forecast data in this kind of situation as the hidden patterns are recognizable which make predictions easy.

Non-stationary: A non-stationary time series will have a time varying mean or time varying variance or both, which makes it impossible to generalize the time series over other time periods.

Non stationary processes can further be explained with the help of a term called Random walk models. This term or theory usually is used in stock market which assumes that stock prices are independent of each other over time. Now there are two types of random walks:
Random walk with drift : When the observation that is to be predicted at a time ‘t’ is equal to last period’s value plus a constant or a drift (α) and the residual term (ε). It can be written as
Yt= α + Yt-1 + εt
The equation shows that Yt drifts upwards or downwards depending upon α being positive or negative and the mean and the variance also increases over time.
Random walk without drift: The random walk without a drift model observes that the values to be predicted at time ‘t’ is equal to last past period’s value plus a random shock.
Yt= Yt-1 + εt
Consider that the effect in one unit shock then the process started at some time 0 with a value of Y0
When t=1
Y1= Y0 + ε1
When t=2
Y2= Y1+ ε2= Y0 + ε1+ ε2
In general,
Yt= Y0+∑ εt
In this case as t increases the variance increases indefinitely whereas the mean value of Y is equal to its initial or starting value. Therefore the random walk model without drift is a non-stationary process.

So, with that we come to the end of the discussion on the Time Series. Hopefully it helped you understand time Series, for more information you can also watch the video tutorial attached down this blog. DexLab Analytics offers machine learning courses in delhi. To keep on learning more, follow DexLab Analytics blog.

.

## R Vs Python: A Debate Forever

In this blog, we will bring forth the age old question and check which one is better, R programming and Python programming, when it comes to data science?

To be very honest, this question does not have a strict answer to it. However, in this blog we will lay down the key components of both the languages to give you a clearer picture. In the end, please decide for yourself and leave your comments in the section below.

The aim of this blog is to objectively put forward the pros and cons of both languages strictly from the perspective of data science.

We will discuss only about three main components, which are as follows:

• Syntax
• Performance
• Applicability

There are other metrics, such as, trends in Industries and adaptation in the recent years which are beyond the scope of this blog. However, you can safely declare Python as the clear winner if those perspectives were concerned.

So let’s get started:

#### Syntax

Both R and Python are object-oriented languages. This is to say that everything is created as an object in which the information is mapped with the idea of using that object later in the analysis. However, when it comes to the syntax, i.e., the grammar of programming, R and Python are indeed very different.

#### R Programming

R programing is more suited to more seasoned coders who have prior experience of coding. The syntax is actually very similar to that of the previous languages, such as C, or C++ or Java and so on. The fundamental rules are that of C programming language. Also, use of semicolons is deemed optional in R. However, semicolons are necessary for multiple lines in a code inside a code block.

#### Python

Python on the other hand, is the language more adaptable to the new generation of programmers. You can come from a non-programming background and still learn Python with relative ease.

Python is one of the most user friendly languages for the beginners. The syntax is designed to prioritize readability over preciseness of the code. In layman’s terms – coding in Python is very close to reading and writing with hand. In this regard, it is really popular amongst beginners in Data Science.

#### Performance

The performance is essentially measured by speed essentially when it comes to programming.

#### R Programming

As far as the general consensus goes R programming is much slower in terms of speed. The reason behind this is that R programming was initially designed to be used by statisticians for data analysis. Thus, R programming stresses more on precision than the speed.

#### Python

Python on the other hand, is relatively faster than R. Python offers the same level of precision whilst acting on a faster speed.

Note – The speed is taken into account independent of packages and libraries.

#### Applicability

Lastly, we will discuss the popular domains in which these languages are used.

#### R Programming

As mentioned above, R was developed specifically for statisticians. For this reason, R is mainly used in various research organizations and academia in general. However, R is now quickly being absorbed in the enterprises as well, mainly because of its popularity and the availability of a large number of packages for statistical computation.

#### Python

Python is a gene

As Python is a general-purpose programming language we can use to build different kinds of applications. We can use Python to build web applications using popular frameworks like Django or Flask.

Lately, Python is becoming popular amongst data scientists as the language of choice given the simplicity of syntax, high speed and performance it has to offer. There has been a trend which has seen a sharp rise in the adaptability of Python over R in the last few years in Data Science.

So, there you have it folks. Decide for yourself now! We will meet you soon in the next blog.

Dexlab Analytics is a pioneering institute of Data Science and Big Data Analytics with all-inclusive Big data courses in Delhi along with numerous other efficacious courses like Hadoop certification in Delhi, R programming courses in Gurgaon and Python for Data Analysis under experienced trainers and professionals.

## Statistical Application in R & Python: Normal Probability Distribution

Gauss, the famous French Mathematician is responsible for developing one of the most significant distributions in all of statistics, i.e. – The Normal Distribution. Please refer to the blog on Central Limit Theorem: www.dexlabanalytics.com/blog/the-almighty-central-limit-theorem. It will help you fully grasp the significance of the Normal Distribution. However, if you want to revisit our series of blogs by following it from the start, you can reach STATISTICAL APPLICATION IN R & PYTHON: CHAPTER 1 – MEASURE OF CENTRAL TENDENCY right now!

Essentially, the Normal Distribution provides “approximations” to most other distributions such as the Binomial, Poisson, Gamma, Exponential, etc. This is to say as sample sizes get statistically large enough, most distributions approximate into a normal shaped curve.

Every distribution has important features known as its “parameters”. Normal distribution has two parameters. These are Mean ( ) and Variance (σ²). The normal distribution has a bell-shaped curve, where the probability of likelihood peaks at its mean in the middle.

The Normal Distribution has vast practical applications in the field of Business, Finance, Medicine, and Physics and so on. Things like weights, heights, IQ scores follow the Normal Distribution.

Normal Distribution, Gaussian distribution, is a continuous probability distribution and is defined by the Probability Density Function (PDF).

Where,

#### Application:

Assume that the credit score fits a Normal Distribution.

Suppose Mr. Arjun’s last 10 month’s credit score are:

789, 635, 739, 687, 724, 810, 817, 735, 819, 820

What is the probability that the percentage of credit score will 825 or more in the 11th month?

 Months Credit Score January 789 February 635 March 739 April 687 May 724 June 810 July 817 August 735 September 819 October 820

#### Calculating Normal Distribution in R:

If we go to calculate Normal Probability Distribution in R, we can predict that the probability of the 11th month credit score will be 825 or greater than that is 14.60%, whereas in another case, the probability of the 11th month credit score will be 825 or less than that is 85.40%.

#### Calculate Normal Distribution in Python:

Make a data frame of the data and calculate Mean and Standard Deviation for calculate Normal Distribution.

Now, we can easily calculate Normal Distribution in Python

So, in calculating the Normal Probability Distribution in Python, we can predict that the probability of the 11th month credit score will be 825 or greater than that is 14.60%, whereas in another case, the probability of the 11th month credit score will be 825 or less than that is 85.40%.

#### Conclusion:

Normal Distribution is used for calculating parameters. It is represented by the bell curve, where the total area of the curve is 1. Normal Distribution has its use in Finance, Business, Salaries, Blood Pressures, Measurement etc and many other fields.

Here, we have used Normal Distribution to predict Mr. Arjun’s 11th month credit score, and set the target (825). By Normal Distribution we can predict the percentage of possibility to achieve the target.

Calculating Binomial Distribution might be tricky for many but with Dexlab Analytics it won’t be hassle anymore. So, get hold of our STATISTICAL APPLICATION IN R AND PYTHON: CALCULATING BINOMIAL DISTRIBUTION blog, to get around all your problems.

## Statistical Application In R & Python: Chapter 1 – Measure Of Central Tendency

Statistical analysis helps explore data relationship and develop high-end models to frame better decisions. It’s an intricate process of collecting and evaluating data to define the nature of data that has to be analyzed.

Below, we dig into the basics of statistical application in R and Python using the measure of central tendency.

• #### Introduction:-

As body methods for the study of numerical data, if some rows or columns are too long, in such cases, it becomes necessary to summarize data in an easily manageable form. The purpose is to serve by classifying the data in the form of frequency distribution and various graphs. When data relate to a variable, the process of summarization can be taken a step further by using certain descriptive measures. The dim is to focus on certain features that are central frequency and description.

• #### Central Tendency :

In a set of data, they have a tendency, notwithstanding their variability, to cluster-around a central value and the tendency of the quantitative statistical observations is called central tendency.

The three measures of the central tendency are commonly used is:-

• Mean
• Median
• Mode

The description of these 3 estimators start below:-

• #### Mean:-

Mean is the average of central tendency and is the most commonly used measures.

The concept of mean is divided into three parts:-

• Arithmetic mean.
• Geometric mean.
• Harmonic mean.

Mainly the mean refers to an arithmetic mean.

• #### Arithmetic Mean (A.M.):-

The arithmetic mean of a set of observations is defined to be their sum, divided by the number of observations.

For n numbers of observation (x1,x2,… ,xn )

• #### Weighted A.M.

For frequency distribution where  have  frequencies. (i=1,2,3…)

• #### Application of A.M.:-

Let’s, calculate the mean of Age, Height & Weight from the given data.

 Name Sex Age Height Weight Ritesh M 24 6.9 112.5 Heena F 23 5.65 84 Kritika F 23 6.53 98 Anuradha F 24 6.28 102.5 Gaurav M 24 6.35 102.5 Prakash M 22 5.73 83 Aarti F 22 5.98 84.5 Meena F 25 6.25 112.5 Utkarsh M 23 6.25 84 Chirag M 22 5.9 99.5 Neha F 21 5.13 50.5 Smrita F 24 6.43 90

#### Calculating Mean in Python:

Therefore,

Age (Mean) = 23.08333333, Height (Mean) = 6.12, weight(Mean) = 85.625

• #### Application of Weighted A.M.:-

The weighted mean is denoted that the mean with frequency.

#### Calculate the average price per ton of coal purchased by the industry for the half-year.

 Month Price Per Ton Tons Purchased January Rs. 52.49 26 February Rs. 62.23 34 March Rs. 87.26 40 April Rs. 45.25 54 May Rs. 78.56 13 June Rs. 69.25 45

#### Data to solve:

 Month Price (Rs)Per Ton(x) TonsPurchased(f) fx=y(Main Data) January 52.49 26 1364.74 February 62.23 34 2115.82 March 87.26 40 3490.4 April 45.25 54 2443.5 May 78.56 13 1021.28 June 69.25 45 3116.25 Total 395.04 N=212 13551.99

The price is denoted as x (52.49, 62.23, 87.26, 45.25, 78.56, 69.25 [in Rs.])=395.04

The amount of purchased (frequency) is denoted by f (26, 34, 40, 54, 13, 45) = 212 (N)

Then multiply the x and f and we get the total amount which is denoted by y, fx(y) = 13551.99

#### Calculate Weighted Mean in Python:

To calculate the weighted mean from R & Python we get the same result = 63.9244811.

Want to know more about the nature of data? Keen to perform high-end statistical analysis using Python and R? Follow DexLab Analytics, an excellent Python training center in Gurgaon, India. Our team of consultants will help you learn the basics of R and Python in the easiest manner possible.

## Predictive Analytics: The Key to Enhance the Process of Debt Collection

A wide array of industries has already engaged in some kind of predictive analytics – numerical analysis of debt collection is relatively a recent addition. Financial analysts are now found harnessing the power of predictive analytics to cull better results out for their clients, and measure the effectiveness of their strategies and collections.

Let’s see how predictive analytics is used in debt collection process:

#### Understanding Client Scoring (Risk Assessment)

Since the late 1980’s, FICO score is regarded as the golden standard for determining creditworthiness and loan application. But, however, machine learning, particularly predictive analytics can replace it, and develop an encompassing portrait of a client, taking into effect more than his mere credit history and present debts. It can also include his social media feeds and spending trajectory.

#### Evaluating Payment Patterns

The survival models evaluate each client’s probability of becoming a potential loss. If the account shows a continuous downward trend, then it should be regarded soon as a potential risk. Predictive analytics can help identify spending patterns, indicating the struggles of each client. A system can be developed which self-triggers whenever any unwanted pattern transpires. It could ask the client if they need any help or if they are going through a financial distress, so that it can help before the situation turns beyond repairs.

For R predictive modeling training courses, visit DexLab Analytics.

#### Cash Flow Predictions

Businesses are keen to know about future cash flows – what they can expect! Financial institutions are no different. Predictive analytics helps in making more appropriate predictions, especially when it comes to receivables.

Debt collector’s business models are subject to the ability to forecast the success of collection operations, and ascertaining results at the end of each month, before the billing cycle initiates. As a result, the workforce of the company is able to shift their focus from the potential payers to those who would not be able to meet their obligations. This shift in focus helps!

#### Better Client Relationship

Predictive analytics weave wonders; not only it has the ability to point which clients are the highest risks for your company, but also predict the best time to contact them to reap maximum results. What you need to do is just visit the logs of past conversations.

#### Challenges

Last, but not the least, all big data models face a common challenge – data cleaning. As it’s a process of wastage in and out, before starting with prediction, company should deal with this problem at first to construct a pipeline, for feeding in the data, clean it and use it for neural network training.

In a concluding statement, predictive analytics is the best bet for debt and revenue collection – it boosts conversion rates at the right time with the right people. If you want to study more about predictive analytics, and its varying uses in different segments of industry, enroll in R Predictive Modelling Certification training at DexLab Analytics. They provide superior knowledge-intensive training to interested individuals with added benefit of placement assistance. For more, visit their website.

The blog has been sourced fromdataconomy.com/2018/09/improving-debt-collection-with-predictive-models

## How R Programming is Transforming Business for Good

Today, every business is putting efforts to understand their customers and themselves, better. But, how? What methods are they applying? Do mere Excel pivot tables help analyze vast pool of data? The answer to the latter question is in the negative – Excel pivot tables are not that great at analyzing data – so a wide number of companies look forward to SAS and R Programming to cull Business Intelligence.

Besides SAS, R-Programming is another open-source language that is used by most of the budding data scientists in the world of analytics. The R Programming language is more oriented towards the correct implication of data science, while ensuring business the cutting edge data analysis tools. Continue reading “How R Programming is Transforming Business for Good”

## How to create Chart Templates with R Functions

R functions are used to produce chart templates to keep the look and feel of the reports intact.

In this post you will come across how to create chart templates with R functions – all the R users should be accustomed to the calling functions so as to perform calculations and outline plots accurately. Remember what colors and fonts to use each time: R functions are used as a short-cut for producing customary-looking charts.

## Analyze Smartphone Sensor Data with R and the BreakoutDetection Package

Quite interetsing. Juggling with sensor data is starkly different from economics data, document processing or social networking, but very worthwhile. In this blog, we will take a practical approach to analyze smartphone sensor data with R. We are going to use the accelerometer smartphone data that Datarella presented in its Data Fiction competition. The dataset signifies the stimulation along the three axes of the smartphone:

x – for sideways stimulation

y – for forward and backward stimulation

z – for upward and downward stimulation

The trickier part lies in its interpretation – on one hand where there are device, manufacturer and sensor specific mutations and artifacts, the other reflects all acceleration is calculated relative to the sensor orientation of the device. For example, taking out the cell phone out of your pocket and reading a tweet can be presented in the following way:

y acceleration – the phone was in the pocket top down but now has been taken out

z and y acceleration – tossing the phone so that it becomes horizontal

x acceleration – moving the smartphone from the left to the middle of your body

z acceleration – bringing  up the phone so that you can read the tweet clearly

And thirdly, the gravity influences all the movements.

Seeking R programming courses in Gurgaon? Feel free to reach us at DexLab Analytics..

Knowing exactly what to do with your smartphone can be quite intimidating – let us introduce an application of the Twitter BreakoutDetection Open Source library (see Github), which is used extensively for Behavioral Change Point analysis.

#### First, I have loaded the dataset and this is how it looks like:

```setwd("~/Documents/Datarella")

user_id           x          y        z                 updated_at                 type
1      88 -0.06703765 0.05746084 9.615114 2014-05-09 17:56:21.552521 Probe::Accelerometer
2      88 -0.05746084 0.10534488 9.576807 2014-05-09 17:56:22.139066 Probe::Accelerometer
3      88 -0.04788403 0.03830723 9.605537 2014-05-09 17:56:22.754616 Probe::Accelerometer
4      88 -0.01915361 0.04788403 9.567230 2014-05-09 17:56:23.372244 Probe::Accelerometer
5      88 -0.06703765 0.08619126 9.615114 2014-05-09 17:56:23.977817 Probe::Accelerometer
6      88 -0.04788403 0.07661445 9.595961  2014-05-09 17:56:24.53004 Probe::Accelerometer
```

#### This data includes the sensor data per user per day:

```accel\$day <- substr(accel\$updated_at, 1, 10)
df <- accel[accel\$day == '2014-05-12' & accel\$user_id == 88,]
df\$timestamp <- as.POSIXlt(df\$updated_at) # Transform to POSIX datetime
library(ggplot2)
ggplot(df) + geom_line(aes(timestamp, x, color="x")) +
geom_line(aes(timestamp, y, color="y")) +
geom_line(aes(timestamp, z, color="z")) +
scale_x_datetime() + xlab("Time") + ylab("acceleration")
```

#### Let’s focus on the period between 12:32 and 13:00:

```ggplot(df[df\$timestamp >= '2014-05-12 12:32:00' & df\$timestamp < '2014-05-12 13:00:00',]) +
geom_line(aes(timestamp, x, color="x")) +
geom_line(aes(timestamp, y, color="y")) +
geom_line(aes(timestamp, z, color="z")) +
scale_x_datetime() + xlab("Time") + ylab("acceleration")
```

#### Following all this, I load the Breakoutdetection library:

```install.packages("devtools")
library(BreakoutDetection)
bo <- breakout(df\$x[df\$timestamp >= '2014-05-12 12:32:00' & df\$timestamp < '2014-05-12 12:35:00'],
min.size=10, method='multi', beta=.001, degree=1, plot=TRUE)
bo\$plot
```

The rapid analysis of the acceleration in the x direction presents us with 4 change points, in which the stimulation suddenly starts to change. At the start, the smartphone normally lies flat on a horizontal surface – the sensor reading revolves around value of 9.8 in a positive direction – which means the gravitational force only triggers this axis and not the x or y axes. Therefore, the phone is lying flat. However, things change and after a couple of movements or changing directions, the last observation reveals the phone has been on a position where the x axis has 9.6 acceleration, meaning the phone is being positioned in a landscape orientation facing the right.

This post originally appeared onwww.r-bloggers.com/how-to-analyze-smartphone-sensor-data-with-r-and-the-breakoutdetection-package