Business Analytics Online Certification Archives - DexLab Analytics | Big Data Hadoop SAS R Analytics Predictive Modeling & Excel VBA

Time Series Analysis Part I

A time series is a sequence of numerical data in which each item is associated with a particular instant in time. Many sets of data appear as time series: a monthly sequence of the quantity of goods shipped from a factory, a weekly series of the number of road accidents, daily rainfall amounts, hourly observations made on the yield of a chemical process, and so on. Examples of time series abound in such fields as economics, business, engineering, the natural sciences (especially geophysics and meteorology), and the social sciences.

• Univariate time series analysis- When we have a single sequence of data observed over time then it is called univariate time series analysis.
• Multivariate time series analysis – When we have several sets of data for the same sequence of time periods to observe then it is called multivariate time series analysis.

The data used in time series analysis is a random variable (Yt) where t is denoted as time and such a collection of random variables ordered in time is called random or stochastic process.

Stationary: A time series is said to be stationary when all the moments of its probability distribution i.e. mean, variance , covariance etc. are invariant over time. It becomes quite easy forecast data in this kind of situation as the hidden patterns are recognizable which make predictions easy.

Non-stationary: A non-stationary time series will have a time varying mean or time varying variance or both, which makes it impossible to generalize the time series over other time periods.

Non stationary processes can further be explained with the help of a term called Random walk models. This term or theory usually is used in stock market which assumes that stock prices are independent of each other over time. Now there are two types of random walks:
Random walk with drift : When the observation that is to be predicted at a time ‘t’ is equal to last period’s value plus a constant or a drift (α) and the residual term (ε). It can be written as
Yt= α + Yt-1 + εt
The equation shows that Yt drifts upwards or downwards depending upon α being positive or negative and the mean and the variance also increases over time.
Random walk without drift: The random walk without a drift model observes that the values to be predicted at time ‘t’ is equal to last past period’s value plus a random shock.
Yt= Yt-1 + εt
Consider that the effect in one unit shock then the process started at some time 0 with a value of Y0
When t=1
Y1= Y0 + ε1
When t=2
Y2= Y1+ ε2= Y0 + ε1+ ε2
In general,
Yt= Y0+∑ εt
In this case as t increases the variance increases indefinitely whereas the mean value of Y is equal to its initial or starting value. Therefore the random walk model without drift is a non-stationary process.

So, with that we come to the end of the discussion on the Time Series. Hopefully it helped you understand time Series, for more information you can also watch the video tutorial attached down this blog. DexLab Analytics offers machine learning courses in delhi. To keep on learning more, follow DexLab Analytics blog.

.

A Guide To Different Types Of Business Analytics

Businesses today can no longer afford to run based on assumptions, they need actionable intel which can help them formulate sharper business strategies. Big data holds the key to all the information they need and the application of business analytics strategies can help businesses realize their goals. Business analytics is about collecting data and processing it to glean valuable business information. Business analytics puts statistical models to use to access business insight. It is a crucial branch of business intelligence that applies cutting edge tools to dissect available data and detect the patterns to predict market trends and doing business analysis training in delhi can help a professional in this field in a big way.

Business analytics could be broken down into four different segments all of which perform different tasks yet all of these are interrelated. The types are namely Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics. The role of each is to offer a thorough understanding of the data to predict future solutions. Find out how these different types of analytics work.

Descriptive Analytics: Descriptive analytics is the simplest form of analytics and the term itself is self-explanatory enough. Descriptive analytics is all about presenting a summary of the data a particular business organization has to create a clear picture of the past trends and also capturing the present situation. It helps an organization to understand what are the areas that need attention and what are their strengths. Analyzing historical data the existence of certain trends could be identified and most importantly could also offer some valuable insight towards developing some plan. Usually, the size of the data both structured and unstructured are beyond our comprehension unless it is presented in some coherent format, something that could be easily ingested. Descriptive analytics performs that function with the help of data aggregation and data mining techniques. For improving communication descriptive analytics helps in summarizing data that needs to be accessible to employees as well as to investors.

Diagnostic Analytics: Diagnostic analytics plays the role of detecting issues a company might be facing. When the entire data set is presented comprehensively, it is time for diagnosis of the patterns detected and detecting issues that might be causing harm. Now, this business analytics dives down deeper into the problem and offers an in-depth analysis to bring out the root cause of the problem. The diagnostic analytics concerns itself with the problem finding aspect by reading data and extracting information to find out why something is not working or, working in a way that is giving considerable trouble. Usually, principle components analysis, conjoint analysis, drill-down, are some of the techniques employed in this specific branch of analytics. Diagnostic analytics takes a critical look at issues and allows the management to identify the reasons so that they can work on that.

Predictive Analytics:  Predictive analytics is sophisticated analytics that is concerned about taking the results of descriptive analytics and working on that to forecast probabilities. It does not predict an outcome but, it suggests probabilities by combining statistics and machine learning. It takes a look at the past data mainly the history of the organization, past performances, and also takes into account the current state and on the basis of that analysis it suggests future trends. However, predictive analytics does not work like magic, it does its job based on the data provided and so, data quality matters here. High quality, complete data ensures accurate prediction, because the data is analyzed to find patterns and further prediction takes off from there. This type of analytics plays a key role in strategizing, based on the forecasts the company can change the sales and marketing strategy and set a new goal.

Prescriptive Analytics: With prescriptive analytics, an organization can find a direction as it is about suggesting solutions for the future. So, it suggests the possible trends or, outcomes, and based on that this analytics can also suggest actions that could be taken to achieve desired results. It employs simulation and optimization modeling to predict which should be the ideal course of action to reach a certain goal. This form of analytics offers recommendations in real-time, it could be thought of as the next step of predictive analytics. Here not just the data previously stored is put to use, but, real-time data is also utilized, in fact, this type of analytics also takes into account data coming from external sources to offer better results.

Those were the four types of business analytics that are employed by data analysts to offer sharp business insight to an organization. However, there needs to be skilled people who have done Business analyst training courses in Gurgaon to be able to carry out business analytics procedure to drive organizations towards a brighter future.

.

Basic of Statistical Inference Part-IV: An Overview of Hypothesis Testing

In this series we cover the basic of statistical inference, this is the fourth part of our discussion where we explain the concept of hypothesis testing which is a statistical technique. You could also check out the 3rd part of the series here.

Introduction

The objective of sampling is to study the features of the population on the basis of sample observations. A carefully selected sample is expected to reveal these features, and hence we shall infer about the population from a statistical analysis of the sample. This process is known as Statistical Inference.

There are two types of problems. Firstly, we may have no information at all about some characteristics of the population, especially the values of the parameters involved in the distribution, and it is required to obtain estimates of these parameters. This is the problem of Estimation. Secondly, some information or hypothetical values of the parameters may be available, and it is required to test how far the hypothesis is tenable in the light of the information provided by the sample. This is the problem of Test of Hypothesis or Test of Significance.

In many practical problems, statisticians are called upon to make decisions about a population on the basis of sample observations. For example, given a random sample, it may be required to decide whether the population, from which the sample has been obtained, is a normal distribution with mean = 40 and s.d. = 3 or not. In attempting to reach such decisions, it is necessary to make certain assumptions or guesses about the characteristics of population, particularly about the probability distribution or the values of its parameters. Such an assumption or statement about the population is called Statistical Hypothesis. The validity of a hypothesis will be tested by analyzing the sample. The procedure which enables us to decide whether a certain hypothesis is true or not, is called Test of Significance or Test of Hypothesis.

What Is Testing Of Hypothesis?

Statistical Hypothesis

Hypothesis is a statistical statement or a conjecture about the value of a parameter. The basic hypothesis being tested is called the null hypothesis. It is sometimes regarded as representing the current state of knowledge & belief about the value being tested. In a test the null hypothesis is constructed with alternative hypothesis denoted by 𝐻1 ,when a hypothesis is completely specified then it is called a simple hypothesis, when all factors of a distribution are not known then the hypothesis is known as a composite hypothesis.

Testing Of Hypothesis

The entire process of statistical inference is mainly inductive in nature, i.e., it is based on deciding the characteristics of the population on the basis of sample study. Such a decision always involves an element of risk i.e., the risk of taking wrong decisions. It is here that modern theory of probability plays a vital role & the statistical technique that helps us at arriving at the criterion for such decision is known as the testing of hypothesis.

Testing Of Statistical Hypothesis

A test of a statistical hypothesis is a two action decision after observing a random sample from the given population. The two action being the acceptance or rejection of hypothesis under consideration. Therefore a test is a rule which divides the entire sample space into two subsets.

1. A region is which the data is consistent with 𝐻0.
2. The second is its complement in which the data is inconsistent with 𝐻0.

The actual decision is however based on the values of the suitable functions of the data, the test statistic. The set of all possible values of a test statistic which is consistent with 𝐻0 is the acceptance region and all these values of the test statistic which is inconsistent with 𝐻0 is called the critical region. One important condition that must be kept in mind for efficient working of a test statistic is that the distribution must be specified.

Does the acceptance of a statistical hypothesis necessarily imply that it is true?

The truth a fallacy of a statistical hypothesis is based on the information contained in the sample. The rejection or the acceptance of the hypothesis is contingent on the consistency or inconsistency of the 𝐻0 with the sample observations. Therefore it should be clearly bowed in mind that the acceptance of a statistical hypothesis is due to the insufficient evidence provided by the sample to reject it & it doesn’t necessarily imply that it is true.

Elements: Null Hypothesis, Alternative Hypothesis, Pot

Null Hypothesis

A Null hypothesis is a hypothesis that says there is no statistical significance between the two variables in the hypothesis. There is no difference between certain characteristics of a population. It is denoted by the symbol 𝐻0. For example, the null hypothesis may be that the population mean is 40 then

𝐻0(𝜇 = 40)

Let us suppose that two different concerns manufacture drugs for including sleep, drug A manufactured by first concern and drug B manufactured by second concern. Each company claims that its drug is superior to that of the other and it is desired to test which is a superior drug A or B? To formulate the statistical hypothesis let X be a random variable which denotes the additional hours of sleep gained by an individual when drug A is given and let the random variable Y denote the additional hours to sleep gained when drug B is used. Let us suppose that X and Y follow the probability distributions with means 𝜇𝑥 and 𝜇𝑌 respectively.

Here our null hypothesis would be that there is no difference between the effects of two drugs. Symbolically,

𝐻0: 𝜇𝑋 = 𝜇𝑌

Alternative Hypothesis

A statistical hypothesis which differs from the null hypothesis is called an Alternative Hypothesis, and is denoted by 𝐻1. The alternative hypothesis is not tested, but its acceptance (rejection) depends on the rejection (acceptance) of the null hypothesis. Alternative hypothesis contradicts the null hypothesis. The choice of an appropriate critical region depends on the type of alternative hypothesis, whether both-sided, one-sided (right/left) or specified alternative.

Alternative hypothesis is usually denoted by 𝐻1.

For example, in the drugs problem, the alternative hypothesis could be

Power Of Test

The null hypothesis 𝐻0 𝜃 = 𝜃0 is accepted when the observed value of test statistic lies the critical region, as determined by the test procedure. Suppose that the true value of 𝜃 is not 𝜃0, but another value 𝜃1, i.e. a specified alternative hypothesis 𝐻1 𝜃 = 𝜃1 is true. Type II error is committed if 𝐻0 is not rejected, i.e. the test statistic lies outside the critical region. Hence the probability of Type II error is a function of 𝜃1, because now 𝜃 = 𝜃1 is assumed to be true. If 𝛽 𝜃1 denotes the probability of Type II error, when 𝜃 = 𝜃1 is true, the complementary probability 1 − 𝛽 𝜃1 is called power of the test against the specified alternative 𝐻1 𝜃 = 𝜃1 . Power = 1-Probability of Type II error=Probability of rejection 𝐻0 when 𝐻1 is true Obviously, we could like a test to be as ‘powerful’ as possible for all critical regions of the same size. Treated as a function of 𝜃, the expression of 𝑃 𝜃 = 1 − 𝛽 𝜃 is called Power Function of the test for 𝜃0 against 𝜃. the curve obtained by plotting P(𝜃) against all possible values of 𝜃, is known as Power Curve.

Elements: Type I & Type II Error

Type I Error & Type Ii Error

The procedure of testing statistical hypothesis does not guarantee that all decisions are perfectly accurate. At times, the test may lead to erroneous conclusions. This is so, because the decision is taken on the basis of sample values, which are themselves fluctuating and depend purely on chance. The errors in statistical decisions are two types:

1. Type I Error – This is the error committed by the test in rejecting a true null hypothesis.
2. Type II Error – This is the error committed by the test in accepting a false null hypothesis.

Considering for the population mean is 40, i.e. 𝐻0 𝜇 = 40 , let us imagine that we have a random sample from a population whose mean is really 40. if we apply the test for 𝐻0 𝜇 = 40 , we might find that the values of test statistic lines in the critical region, thereby leading to the conclusion that the population mean is not 40; i.e. the test rejects the null hypothesis although it is true. We have thus committed what is known as “Type I error” or “Error of first kind”. On the other hand, suppose that we have a random sample from a population whose mean is known to different from 40, say 43. if we apply the test for 𝐻0 𝜇 = 40 , the value of the statistic may, by chance, lie in the acceptance region, leading to the conclusion that the mean may be 40; i.e. the test does not reject the null hypothesis 𝐻0 𝜇 = 40 , although it is false. This is again another form of incorrect decision, and the error thus committed is known as “Type II error” or “Error of second kind”.

Using sampling distribution of the test statistic, we can measure in advance the probabilities of committing the two types of error. Since the null hypothesis is rejected only when the test statistic falls in the critical region.

Probability of Type I error = Probability of rejecting 𝐻0 𝜃 = 𝜃0 , when it is true
= Probability that the test statistic lies in the critical region, assuming 𝜃 = 𝜃0.

The probability of Type I error must not exceed the level of significance (𝛼) of the test.

𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑇𝑦𝑝𝑒 𝐼 𝑒𝑟𝑟𝑜𝑟 ≤ 𝐿𝑒𝑣𝑒𝑙 𝑜𝑓 𝑆𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒

The probability of Type II error assumes different values for different values of 𝜃 covered by the alternative hypothesis 𝐻1. Since the null hypothesis is accepted only when the observed value of the best statistic lies outside the critical region.

Probability of Type II error 𝑊ℎ𝑒𝑛 𝜃 = 𝜃1
= Probability of accepting 𝐻0 𝜃 = 𝜃0 , when it is false
= Probability that the test statistic lies in the region of acceptance, assuming 𝜃 = 𝜃1

The probability of Type I error is necessary for constructing a test of significance. It is in fact the ‘size of the Critical Region’. The probability of Type II error is used to measure the “power” of the test in detecting falsity of the null hypothesis. When the population has a continuous distribution

Probability of Type I error
= Level of significance
= Size of critical region

Elements: Level Of Significance & Critical Region

Level Of Significance And Critical Region

The decision about rejection or otherwise of the null hypothesis is based on probability considerations. Assuming the null hypothesis to be true, we calculate the probability of obtaining a difference equal to or greater than the observed difference. If this probability is found to be small, say less than .05, the conclusion is that the observed value of the statistic is rather unusual and has been caused due to the underlying assumption (i.e. null hypothesis) that is not true. We say that the observed difference is significant at 5 per cent level, and hence the ‘null hypothesis is rejected’ at 5 per cent level of significance. If, however, this probability is not very small, say more than .05, the observed difference cannot be considered to be unusual and is attributed to sampling fluctuation only. The difference is, now said to be not significant at 5 per cent level, and we conclude that there is no reason to reject the null hypothesis’ at 5 per cent level of significance. It has become customary to use 5% and 1% level of significance, although other levels, such as 2% or 5% may also be used.

Without actually going to calculate this probability, the test of significance may be simplified as follows. From the sampling distribution of the statistic, we find the maximum difference is which is exceeded in (say 5) percent of cases. If the observed difference in larger than this value, the null hypothesis is rejected. It is less there in no reason to reject the null hypothesis.

Suppose, the sampling distribution of the statistic is a normal distribution. Since the area under normal curve outside the ordinates at mean ±1.96 (𝑠. 𝑑. ) is only 5%, the probability that the observed value of the statistic differs from the expected value of 1.96 times the S.E. or more is .05; and the probability of a larger difference will be still smaller. If, therefore

Is either greater than 1.96 or less than -1.96 (i.e. numerically greater than 1.96), the null hypothesis 𝐻0 is rejected at 5% level of significance. The set values 𝑧 ≥ 1.96 𝑜𝑟 ≤ −1.96, i.e.

|𝑧| ≥ 1.96

constitutes what is called the Critical Region for the test. Similarly since the area outside mean ±2.58 (s.d.) is only 1%. 𝐻0 is rejected at 1% level of significance, if z numerically exceeds 258, i.e. the critical region is 𝑧 ≥ 2.58 at 1% level. Using the sampling distribution of an appropriate test statistic we are able to establish the maximum difference at a specified level between the observed and expected values that is consistent with null hypothesis 𝐻0 . The set of values of the test statistic corresponding to this difference which lead to the acceptance of 𝐻0 is called Region of acceptance. Conversely, the set of values of the statistic leading to the rejection of 𝐻0 is referred to as Region of Rejection or “Critical Region” of the test. The value of the statistic which lies at the boundary of the regions of acceptance and the rejection is called Critical value. When the null hypothesis is true, the probability of observed value of the test statistic falling in the critical region is often called the “Size of Critical Region”.

𝑆𝑖𝑧𝑒 𝑜𝑓 𝐶𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑅𝑒𝑔𝑖𝑜𝑛 ≤ 𝐿𝑒𝑣𝑒𝑙 𝑜𝑓 𝑆𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒

However, for a continuous population, the critical region is so determined that its size equals the Level of Significance (𝛼).

Two-Tailed And One-Tailed Tests

Our discussion above were centered around testing the significance of ‘difference’ between the observed and expected values, i.e. whether the observed value is significantly different from (i.e. either larger or smaller than) the expected value, as could arise due to fluctuations of random sampling. In the illustration, the null hypothesis is tested against “both-sided alternatives” 𝜇 > 40 𝑜𝑟 𝜇 < 40 , i.e.

𝐻0 𝜇 = 40 𝑎𝑔𝑎𝑖𝑛𝑠𝑡 𝐻1 𝜇 ≠ 40

Thus assuming 𝐻0 to be true, we would be looking for large differences on both sides of the expected value, i.e. in “both tails” of the distribution. Such tests are, therefore, called “Two-tailed tests”.

Sometimes we are interested in tests for large differences on one side only i.e., in one ‘one tail’ of the distribution. For example, whether a change in the production bricks with a ‘higher’ breaking strength, or whether a change in the production technique yields ‘lower’ percentage of defectives. These are known as “One-tailed tests”.

For testing the null hypothesis against “one-sided alternatives (right side)” 𝜇 > 40 , i.e.

𝐻0 𝜇 = 40 𝑎𝑔𝑎𝑖𝑛𝑠𝑡𝐻1 𝜇 > 40

The calculated value of the statistic z is compared with 1.645, since 5% of the area under the standard normal curve lies to the right of 1.645. if the observed value of z exceeds 1.645, the null hypothesis 𝐻0 is rejected at 5% level of significance. If a 1% level were used, we would replace 1.645 by 2.33. thus the critical regions for test at 5% and 1% levels are 𝑧 ≥ 1.645 and 𝑧 ≥ 2.33 respectively.

For testing the null hypothesis against “one-sided alternatives (left side)” 𝜇 < 40 i.e.

𝐻0 𝜇 = 40 𝑎𝑔𝑎𝑖𝑛𝑠𝑡𝐻1 𝜇 < 40

The value of z is compared with -1.645 for significance at 5% level, and with -2.33 for significance at 1% level. The critical regions are now 𝑧 ≤ −1.645 and 𝑧 ≤ −2.33 for 5% and 1% levels respectively. In fact, the sampling distributions of many of the commonly-used statistics can be approximated by normal distributions as the sample size increases, so that these rules are applicable in most cases when the sample size is ‘large’, say, more than 30. It is evident that the same null hypothesis may be tested against alternative hypothesis of different types depending on the nature of the problem. Correspondingly, the type of test and the critical region associated with each test will also be different.

Solving Testing Of Hypothesis Problem

Step 1
Set up the “Null Hypothesis” 𝐻0 and the “Alternative Hypothesis” 𝐻1 on the basis of the given problem. The null hypothesis usually specifies the values of some parameters involved in the population: 𝐻0 𝜃 = 𝜃0 . The alternative hypothesis may be any one of the following types: 𝐻1 ( ) 𝜃 ≠ 𝜃1 𝐻1 𝜃 > 𝜃0 , 𝐻1 𝜃 < 𝜃0 . The types of alternative hypothesis determines whether to use a two-tailed or one-tailed test (right or left tail).

Step 2

State the appropriate “test statistic” T and also its sampling distribution, when the null hypothesis is true. In large sample tests the statistic 𝑧 = (𝑇 − 𝜃0)Τ𝑆. 𝐸. , (T) which approximately follows Standard Normal Distribution, is often used. In small sample tests, the population is assumed to be Normal and various test statistics are used which follow Standard Normal, Chi-square, t for F distribution exactly.

Step 3
Select the “level of significance” 𝛼 of the test, if it is not specified in the given problem. This represents the maximum probability of committing a Type I error, i.e., of making a wrong decision by the test procedure when in fact the null hypothesis is true. Usually, a 5% or 1% level of significance is used (If nothing is mentioned, use 5% level).

Step 4

Find the “Critical region” of the test at the chosen level of significance. This represents the set of values of the test statistic which lead to rejection of the null hypothesis. The critical region always appears in one or both tails of the distribution, depending on weather the alternative hypothesis is one-sided or both-sided. The area in the tails must be equal to the level of significance 𝛼. For a one-tailed test, 𝛼 appears in one tail and for two-tailed test 𝛼/2 appears in each tail of the distribution. The critical region is

Where 𝑇𝛼 is the value of T such that the area to its tight is 𝛼.

Step 5

Compute the value of the test statistic T on the basis of sample data the null hypothesis. In large sample tests, if some parameters remain unknown they should be estimated from the sample.
Step 6

If the computed value of test statistic T lies in the critical region, “reject 𝐻0”; otherwise “do not reject 𝐻0 ”. The decision regarding rejection or otherwise of 𝐻0 is made after a comparison of the computed value of T with critical value (i.e., boundary value of the appropriate critical region).

Step 7
Write the conclusion in plain non-technical language. If 𝐻0 is rejected, the interpretation is: “the data are not consistent with the assumption that the null hypothesis is true and hence 𝐻0 is not tenable”. If 𝐻0 is not rejected, “the data cannot provide any evidence against the null hypothesis and hence 𝐻0 may be accepted to the true”. The conclusion should preferably be given in the words stated in the problem.

Conclusion

Hypothesis is a statistical statement or a conjecture about the value of a parameter. The legal concept that one is innocent until proven guilty has an analogous use in the world of statistics. In devising a test, statisticians do not attempt to prove that a particular statement or hypothesis is true. Instead, they assume that the hypothesis is incorrect (like not guilty), and then work to find statistical evidence that would allow them to overturn that assumption. In statistics this process is referred to as hypothesis testing, and it is often used to test the relationship between two variables. A hypothesis makes a prediction about some relationship of interest. Then, based on actual data and a pre-selected level of statistical significance, that hypothesis is either accepted or rejected. There are some elements of hypothesis like null hypothesis, alternative hypothesis, type I & type II error, level of significance, critical region and power of test and some processes like one and two tail test to find the critical region of the graph as well as the error that help us reach the final conclusion.

A Null hypothesis is a hypothesis that says there is no statistical significance between the two variables in the hypothesis. There is no difference between certain characteristics of a population. It is denoted by the symbol 𝐻0. A statistical hypothesis which differs from the null hypothesis is called an Alternative Hypothesis, and is denoted by 𝐻1. The procedure of testing statistical hypothesis does not guarantee that all decisions are perfectly accurate. At times, the test may lead to erroneous conclusions. This is so, because the decision is taken on the basis of sample values, which are themselves fluctuating and depend purely on chance, this process called types of error. Hypothesis testing is very important part of statistical analysis. By the help of hypothesis testing many business problem can be solved accurately.

That was the fourth part of the series, that explained hypothesis testing and hopefully it clarified your notion of the same by discussing each crucial aspect of it. You can find more informative posts like this one on Data Science course topics. Just keep on following the Dexlab Analytics blog to stay informed.

.

Classical Inferential Statistics: Theory of Sampling (Part -1)

1. Introduction:

Predictive models are developed over a specific time period and on a certain set of records. However, implementation happens on a mutually exclusive time period (Out of Time Sample). Therefore, the models developed need to be trained and validated on different datasets: 1. Model Development Data (training data) 2. In sample validation data 3. Out of time validation data. A predictive model is considered to be robust, if their performance remains more or less stable in the out of time samples. An important observation from the description above is the following: The entire data (Population) is never accessible for model development and hence, is unknown. Models are developed on subsets (Samples) which are representative of the entire data. Representativeness of the samples are important to ensure the robustness in the model performance. This blog explores the key concepts related to creating representative samples from the population. Section 2 describes the basic components of the classical sampling theory, Section 3 describes the key types of sampling, Section 4 introduces the concept of Sampling Distribution and Section 5 concludes with the key summary of findings.

• Introduction To Population and Sample

The two basic blocks of Classical Sampling Theory are: 1. Population 2. Samples. Populationis defined as the base of all the observations which are eligible to be studied to address key questions relating to a statistical investigation or a business problem, irrespective of whether it can be accessed or not. In real time the entire population is always unknown since there is a part of the population which cannot be accessed due to different reasons such as: Data Archiving Problems, Data permissions, Data Accessibility etc. A representative subset of the population is called a sample. The distribution of the variables in the sample is used to form an idea about the respective distribution of the variables in the population.

In a real time, any predictive modelling exercise uses the samples, since they cannot practically use the population. The population is not accessible because of the following reasons:

1. Observation Exclusions used in models: Observation Exclusions are used in predictive models to remove unnecessary observations, which are redundant for analysis. For example, when developing a credit risk model, observations which are bankrupts or frauds are removed from the analysis, since frauds and bankrupts are a part of operational risk.
2. Variable Exclusions used in the models: Variable Exclusions are used in predictive models to remove unnecessary variables which are redundant for analysis. For example, when developing a credit risk model, variables which are market-oriented variables or operational variables are excluded.
3. Robustness Check of the developed models: The developed models are validated on multiple samples such as In-sample Validation data, Out of Time Validation samples Therefore, only a fraction of the dataset is available for model development. Hence, the population is always unknown, irrespective of the datasets, and hence the key statistical distributions of the population are anonymous
• Mathematical Framework To Describe The Sampling Theory Framework:

Let X be a N x k vector (where N = Total number of rows that the matrix has (observation) and k = Total number of columns (variables)) which is normally distributed with mean μ and variance The population mean μ and variance both are unknown numerical features of the population distribution. These are called the Parameters: A functional form of all the population observations.

The key objective of the Classical Sampling Theory is to provide the appropriate guidelines for analysing the Population parameters based on the statistical moments of the sample. The statistical moments of the samples are called Estimators. The Estimators are a functional form of all the sample observations. For example, let us assume a subset of size ‘n’ is extracted from X such that the sample S is a n x k vector which is normally distributed.are the sample means and the sample variance respectively. The descriptive moments are called statistics. A Statistic is an estimator with a sampling distribution. (Detailed Discussion: Section 4). The key objective of the classical sampling theory is to estimate the population parameters using the sample statistics, such that any difference between the two measures are statistically insignificant and considered to be an outcome of sampling fluctuations.

3. Types of Sampling

Broadly, there are two types of sampling methods discussed under the Classical Sampling theory: (i) Random Sampling (ii) Purposive Sampling. The different types of sampling and a brief description of each is provided in the figure below:

• Applications Of Sampling Methods:

In the real time predictive modelling exercise, Stratified Random Sampling is considered to be of a wider appeal, than the Simple Random Sampling. Business datasets contain different categorical variables like: Product Type, Branch Size category, Gender, Income Groups etc. While splitting the total data into development data and Validation data, it is important to ensure that representation of the key categorical variables is made in the samples. This is important to ensure representativeness of the sample and robustness of the model. In this case a stratified random sampling is more preferred than the Simple Random Sampling. The use of Simple Random Sampling is limited to the cases where the data is symmetric and not much of heterogeneity is observed among the distribution of the values of the variables. The following examples discuss the applications of the Classical Sampling methods:

Example01: Splitting the Model Development Data into Training and Validation dataset

Models, when developed needs to be validated. The standard practice is to divide the data into 70% – 30% proportion. The models are trained on 70% of the observations and validated using the remaining 30%. To ensure the robustness of the model the distribution of the target variable should be similar in both the development and validation datasets. Therefore, the target variable is used as the Strata variable.

Example02: Boot Strapping Analysis

Boot Strapping Exercises exhaustively use Simple Random Sampling with Replacement. It is a nonparametric resampling method used to assign measures of accuracy to sample estimates.

4. Sampling Distribution: Overview

Sampling distribution of a statistic may be defined as the probability law which the statistic follows, if repeated random samples of a fixed size are drawn from specified population. A number of samples, each of size n, are taken from the same population and if for each sample the values of the statistic is calculated, a series of values of the statistic will be obtained. If the number of samples is large, these may be arranged into a frequency table. The frequency distribution of the statistic that would be obtained if the number of samples, each of the same size (say n), were infinite is called the Sampling distribution of the statistic. The table below shows a Sample Distribution and its associated frequency distribution:

5. Conclusion:

The blog, brought to you by DexLab Analytics, a premier institute conducting statistical analytics courses in Gurgaon and business analysis training in Delhi, introduces the basic concepts of Classical Sampling Framework. The objective here has been to explore the broad tenets of sampling theory, such as the different methods of sampling, their usages and their respective advantages and disadvantages. The Stratification Random Sampling is a more versatile sampling method compared to Simple Random Sampling methods. The concept of Sampling Distribution has been introduced but not discussed in details. This is to be the subject matter of the next blog: Sampling Distributions and its importance in Sampling theory.

.

Citizen Data Scientists: Who Are They & What Makes Them Special?

Companies across the globe are focusing their attention on data science to unlock the potentials of their data. But, what remains crucial is finding well trained data scientists for building such advanced systems.

Today, a lot many organizations are seeking citizen data scientists – though the notion isn’t something new, the practice is fairly picking up pace amongst the industries. Say thanks to a number of factors, including perpetual improvement in the quality of tools and difficulty in finding properly skilled data scientists!

Gartner, a top notch analyst firm has been promoting this virgin concept for the past few years. In 2014, the firm predicted that the total number of citizen data scientists would expand 5X faster than normal data scientists through 2017. Although we are not sure if the number forecasted panned out right but what we know is that the proliferating growth of citizen data scientists exceeded our expectations.

Recently, Gartner analyst Carlie Idoine explained a citizen data scientist is one who “creates or generates models that use advanced diagnostic analytics or predictive and prescriptive capabilities, but whose primary job function is outside the field of statistics and analytics.” They are also termed as “power users”, who’ve the ability to perform cutting edge analytical tasks that require added expertise. “They do not replace the experts, as they do not have the specific, advanced data science expertise to do so. But they certainly bring their OWN expertise and unique skills to the process,” she added.

Of late, citizen data scientists have become critical assets to an organization. They help businesses discover key big data insights and in the process are being asked to derive answers from data that’s not available from regular relational database. Obviously, data can’t be queried through SQL, either. As a result, citizen data scientists are found leveraging machine learning models that end up generating predictions from a large number of data types. No wonder, SQL always sounds effective, but Python statistical libraries and Jupyter notebooks helps you further.

A majority of industries leverages SQL; it has been data’s lingua franca for years. The sheer knowledge of how to write a SQL query to unravel a quiver of answers out of relational databases still remains a crucial element of company’s data management system as a whole lot of business data of companies are stored in their relational databases. Nevertheless, advanced machine learning tools are widely gaining importance and acceptance.

A wide array of job titles regarding citizen data scientists exists in the real world, and some of them are mutation of business analyst job profile. Depending on an organization’s requirements, the need for experienced analysts and data scientists varies.

Looking for a good analytics training institute in Delhi? Visit DexLab Analytics.

DataRobot, a pioneering proprietary data science and machine learning automation platform developer is recently found helping citizen data scientists through the power of automation. “There’s a lot happening behind the scenes that folks don’t realize necessarily is happening,” Jen Underwood, a BI veteran and the recently hired DataRobot’s director of product marketing said. “When I was doing data science, I would run one algorithm at a time. ‘Ok let’s wait until it ends, see how it does, and try another, one at a time.’ [With DataRobot] a lot of the steps I was taking are now automated, in addition to running the algorithms concurrently and ranking them.”

To everyone’s knowledge, Big Data Analytics is progressing, capabilities that were once restricted within certain domains of professionals are now being accessible by a wider pool of interested parties. So, if you are interested in this new blooming field of opportunities, do take a look at our business analyst training courses in Gurgaon. They would surely help you in charting down a successful analyst career.

The blog has been sourced fromdatanami.com/2018/08/13/empowering-citizen-data-science

#TimeToReboot: 10 Random, Fun Facts You Must Know About IT Industry

Indian IT sector is expected to grow at a modest rate this fiscal year, which started from April – companies are expanding their scopes and building new capabilities or enhancing the older ones. Demand for digital services is showing spiked up trends. The good news is that the digital component industry is flourishing, faster than expected. It’s forming a bigger part of tech-induced future, and we’re all excited!!

On that positive note, here we’ve culled down a few fun facts about IT industry that are bound to intrigue your data-hungry heart and mind… Hope you’ll enjoy the read as much as I did while scampering through research materials to compile this post!

Let’s get started…

1 out of 8 marriages in the US happened between couples who’ve met online. Wicked?

Feeling excited to know all these stuffs… Now, on serious note, in the days to come, the Indian IT industry is all set to transform itself with high velocity tools and technology, and if you want to play a significant role in this digital transformation, arm yourself with decent data-friendly skill or tool.

The deal turns sweeter if you hail from computer science background or have a knack to play with numbers. If such is the case, we have high end business analyst training courses in Gurgaon to suit your purpose and career aspiration – drop by DexLab Analytics, being a top of the line analytics training institute, they bring to you a smart concoction of knowledge, aptitude and expertise in the form of student-friendly curriculum. For more details, visit their site today.

The blog has been sourced from:

qarea.com/blog/facts-from-the-it-industry

Transforming Society with Blockchain and Its Potential Applications Worldwide

According to Google Search, ‘blockchain’ is defined as “a digital ledger in which transactions made in bitcoin or in other cryptocurrency is recorded chronologically and publicly.”

Speaking in a way of cryptocurrency, a block is a record of new transactions that could mean the actual location of cryptocurrency. Once each block has completed its transaction, it’s added to the chain, creating a chain of blocks known as blockchain.

Suppose a Google spreadsheet is shared by each and every computer which is connected to the internet in this world. When a transaction happens, it will be recorded in a row of this spreadsheet. Just like a spreadsheet has rows, Blockchain consists of Blocks for each transaction.

Whoever has access to a computer or mobile can connect to the internet and can have access to the spreadsheet and add a transaction, but the spreadsheet doesn’t permit anyone to edit the information which is already available. No third party can interfere into its transactions, therefore saves time and conflict.

Types of Blockchains:

• Open and permission-less: Public and permissionless blockchains look like bitcoin, the first blockchain. All exchanges in these blockchains are open and no authorizations are required to join these circulated elements.
• Private and permission: These blockchains are constrained to assigned individuals, exchanges are private, and authorization from a proprietor or supervisor substance is required to join this system. These are frequently utilized by private consortia to oversee industry esteem chain openings.
• Hybrid blockchains: An extra region is a developing idea of sidechain, which takes into consideration distinctive blockchains (open or private) to speak with each other, empowering exchanges between members crosswise over blockchain systems.

Various Applications Of Blockchain Are As Follows:

a) Smart Contracts:

Smart Contracts eases the way we exchange money, property, shares and avoids third person/party conflicts. Smart keys access can only be permitted to the authorized party. Basically, computers are given the command to control the contracts and to release or hold the funds by giving the keys to the permitted persons.

For example, if I want to rent an office space from you, we can do this in blockchain using cryptocurrency.  You will get a receipt which is saved in the virtual contract and I will get the digital entry key which will reach me by a specified date. If you send the key before the specified date, the function holds it and releases both receipt and the key when the date arrives.

If I receive the key I surely should pay you. And this contract will be canceled when the time gets complete, and it cannot interfere as all the participants will be alerted. The Smart contracts can be used for insurance premiums, financial derivatives, financial services, legal processes etc.

b) Digital Identity:

The future of blockchain will be blooming in the coming years. Blockchain technologies make both managing and tracking digital identities reliable and systematic, resulting in easy registering and minimizing fraud.

Be it national security, citizenship documentation, banking, online retailing or healthcare, identity authentication and authorization is a process entangled in between commerce and culture, worldwide.  Introducing blockchain into identity-based mechanisms can really bring captivating solutions to the security problems we have online.

Blockchain technology is known to offer a solution to many digital identity issues, where identity can be uniquely validated in an undeniable, unchangeable, and secured manner.

Present-day methods involve problematic password-based systems of known secrets which are exchanged and stored on insecure computer systems. Blockchain-based certified systems are actually built on undeniable identity verification for using digital signatures based on the public key related cryptography.

In blockchain identity confirmation, the only check that is performed is to know if the transaction was signed by the authorized private key. It is implied to whoever has access to the private key is the owner and the exact identity of the owner is deemed unrelated.

c) Insurance:

Claims dealing can be disappointing and unrewarding. Insurance agents need to go through deceitful cases and deserted approaches, or divided information sources for clients to express a few – and process these documents manually. Space for mistake is enormous. The blockchain gives an ultimate framework for hazard-free administration and clarity. Its encryption properties enable insurers to represent the ownership to be protected.

“This will be the toughest on the portions of the industry that are least differentiated, where consumers often decide based on price: auto, life, and homeowner’s insurance.” — Harvard Business Review

d) Supply-Chain Communications and Proof-of-Provenance:

The majority of the things we purchase aren’t made by a single organization, yet by a chain of providers who offer their ingredients (e.g., graphite for pencils) to an organization that gathers and markets the final commodity. On the off chance that any of those commodities flops, in any case, the brand takes the brunt of the backfire — it holds most of the duty regarding its supply chain network.

However, consider the possibility that an organization could proactively give carefully perpetual, auditable records that show stakeholders the condition of the item at each esteem included process.

This is not a little task: The worldwide supply chain network is evaluated to be worth \$40 trillion; and from a business-process point of view, it’s a fabulously incapable chaos. As a related issue, blockchain can be utilized to track diamonds, creative skill, real estate, and practically any other resources.

e) Music Industry:

While music lovers have hailed digitization as the popular government of the music business, 15.7 billion dollar music industry is confusingly continuing as before. Music piracy through unlawfully downloaded, duplicated and shared content eats into the artist’s sovereignties and music labels’ income. Added to this, is the absence of a vigorous rights administration framework, which prompts loss of income to the artist.

Also, the income, when it really achieves the artist, can take up to two years! Another region of concern is unpaid sovereignties, which are frequently suspended in different stages because of missing data or rights possession. There is additionally an absence of access to continuous advanced sales information, which if accessible can be utilized to strategize advertising efforts more successfully.

These very zones are the place Blockchain can have stunning effects. As a publically accessible and decentralized database that is distributed over the web, Blockchain keeps up lasting and undeletable records in cryptographic format. Exchanges happen over a peer to peer system and are figured, confirmed and recorded utilizing a computerized agreement strategy, disposing of the requirement for an intermediator or outsider to oversee or control data.

The very engineering of Blockchain being unchanging, dispersed and distributed conveys enormous potential to manage the present troubles influencing the music business.

An essential region in which Blockchain can bring out positive change is in the formation of a digital rights database. Digital rights articulation is one of the basic issues distressing the present music industry. Recognizing copyright of a melody and characterizing how sovereignties ought to be part of musicians, entertainers, distributors, and makers are troublesome in digital space. Regularly artists miss out on sovereignties because of complicated copyright condition.

Blockchain’s changeless distributed ledger framework, which guarantees that no single organization can assert proprietorship, ensures an ideal arrangement. Secure documents with all applicable data, for example, structure, versus, straight notes, cover craftsmanship, permitting, and so on, can be encoded onto the Blockchain making a changeless and inerasable record.

f) Government and Public records:

The administration of public services is yet another region, where blockchain can help diminish paper-based procedures, limit fraud, and increment responsibility amongst specialists and those they serve.

Some US states are volunteering to understand the advantages of blockchain: the Delaware Blockchain Initiative propelled in 2016, expects to make a proper legitimate foundation for distributed ledger shares to increase productivity and speed of consolidation administrations.

Illinois, Vermont, and different states have since reported comparative activities. Startup companies are sponsoring in the effort also: in Eastern Europe, the BitFury Group is presently working with the Georgian government to secure and track government records.

Conclusion:

This article focused on the blockchain and its applications in various industries explains challenges and potentials and how people can secure their information digitally without any issues and increasing their ability. As these applications are still under development and yet to be untangled in the future, blockchain could become a powerful tool conducting fair trade, improving business and supporting the society.

To never miss a beat of technology related news and feeds – follow DexLab Analytics. We are a team of experts offering state of the art business analyst training courses in Gurgaon. Not only that, we provide a plethora of machine learning and Hadoop courses too for all the data-hungry candidates. So, drop by and quench your thirst for data from us!

K.Maneesha is an SEO Developer At Mindmajix.com. She holds a masters degree in Marketing from Alliance University, Bangalore. Maneesha is a dog-lover and enjoys traveling with friends on trips. You can reach her at manisha.m4353@gmail.com. Her LinkedIn profile Maneesha Kakulapati.

Business Intelligence Software in the Key for an Organization to Gain Competitive Advantage

Business Intelligence, or BI, is crucial for organizations as strategic planning is heavily dependent on BI. BI tools are multi-purpose and used for indicating progress towards business goals, quantitatively analyzing data, distribution of data and developing customer insights.

Advanced computer technologies are applied in Business Intelligence to discover relevant business data and then analyze it. It not only spots current trends in data, but is also able to develop historical views and future predictions. This helps decision-makers to comprehend business information properly and develop strategies that will steer their organization forward.

BI tools transform raw business information into valuable data that increase revenue for organizations. The global business economy is completely data driven. Companies without BI software will be jeopardizing their success. It is time to shed the belief that BI software is superfluous. Rather, it is a necessity.

Nowadays, there isn’t much time to ponder over data sheets and then come to a conclusion. Decisions have to be taken on the spot. Valuable information doesn’t include business data alone, but also what the data implies for your business. BI gives you a competitive lead as it provides valuable information with the push of a button.

Business Intelligence software provides KPIs (Key Performance Indicators), which are metrics aligned with your business strategies. Thus, businesses can make decisions based on solid facts instead of intuition. This makes business proceedings more efficient.

1. Employees have data-power

BI solutions help employees to make informed decisions backed by relevant data. Access to information across all levels ensures company-wide integration of data. This helps employees nurture their skills. A competitive workforce will help a company gain global recognition.

Business intelligence is able to determine where and how potential customers consume data, how to convert them to paying customers, and chalk out an appropriate plan that will help increase revenue for your business.

1. Avoid blockages in markets

There are many BI applications that can be incorporated with accounting software. Business intelligence provides information about the real health of an organization, which cannot be determined from a profit and loss sheet. BI includes predictive features that help avoid blockages in markets and determine the right time for important decisions, like hiring new employees. Easy-to understand dashboards enable decision-makers to stay informed.

1. Create an efficient business model

As explained by Jeremy Levi, Director of Marketing, MarsWellness.com, ‘’ Why is BI more important than ever? In one word: oversaturation. The internet and the continued growth of e-commerce have saturated every market…For business owners, this means making smart decisions and trying to know where to put your marketing dollars and where to invest in infrastructure. Business intelligence lets you do that, and without it, you’re simply fumbling around for the light switch in the dark.”

1. Improved customer insights

In the absence of BI tools, one can spend hours trying to make sense out of previous reports without coming to a satisfactory conclusion. It is crucial for businesses to meet customer demands. BI tools help map patterns in customer behavior so businesses can prioritize loyal customers and improve customer satisfaction.

1. Helps save money

BI tools help spot areas in your business where costs can be minimized. For example, there is unnecessary spending occurring in the supply chain. BI can identify whether it is inefficient acquisition or maintenance that is translating to increased costs. Thus, it enables businesses to take the necessary actions to cut costs.

1. Improve efficiency of workers

Business intelligence solutions can monitor the output of members and functioning of teams. These help improve efficiency of the workers and streamline the business processes.

1. Protects businesses from cyber threats

Cyber crimes like data breaches and malware attacks are very common. Cyber security has become the need of the hour. Businesses should invest in BI solutions equipped with security tools that help protect their valuable data from hackers and other cyber attacks.

Businesses will progress rapidly through the use of smart BI solutions. Organizations small or big, can use BI tools in a variety of areas, starting from budgets to building relationship with customers.

If you want to empower your business through BI then enroll yourself for the Tableau BI certification course at DexLab Analytics, Delhi. DexLab is a premium institute providing business analysis training in Delhi.

How Credit Unions Can Capitalize on Data through Enterprise Integration of Data Analytics

To get valuable insights from the enormous quantity of data generated, credit unions need to move towards enterprise integration of data. This is a company-wide data democratization process that helps all departments within the credit union to manage and analyze their data. It allows each team member easy-access and proper utilization of relevant data.

However, awareness about the advantages of enterprise-wide data analytics isn’t sufficient for credit unions to deploy this system. Here is a three step guide to help credit unions get smarter in data handling.

Improve the quality of data

A robust and functional customer data set is of foremost importance. Unorganized data will hinder forming correct opinions about customer behavior. The following steps will ensure that relevant data enters the business analytics tools.

• Integration of various analytics activity- Instead of operating separate analytics software for digital marketing, credit risk analytics, fraud detection and other financial activities, it is better to have a centralized system which integrates these activities. It is helpful for gathering cross-operational cognizance.
• Experienced analytics vendors should be chosen- Vendors with experience can access a wide range of data. Hence, they can deliver information that is more valuable. They also provide pre-existing integrations.
• Consider unconventional sources of data- Unstructured data from unconventional sources like social media and third-parties should be valued as it will prove useful in the future.
• Continuous data cleansing that evolves with time- Clean data is essential for providing correct data. The data should be organized, error-free and formatted.

Data structure customized for credit unions

The business analytics tools for credit unions should perform the following analyses:

• Analyzing the growth and fall in customers depending on their age, location, branch, products used, etc.
• Measure the profit through the count of balances
• Analyze the Performances of the staffs and members in a particular department or branch
• Sales ratios reporting
• Age distribution of account holders in a particular geographic location.
• Perform trend analysis as and when required
• Analyze satisfaction levels of members
• Keep track of the transactions performed by members
• Track the inquires made at call centers and online banking portals
• Analyze the behavior of self-serve vs. non-self serve users based on different demographics
• Determine the different types of accounts being opened and figure out the source responsible for the highest transactions.

User-friendly interfaces for manipulating data

Important decisions like growing revenue, mitigating risks and improving customer experience should be based on insights drawn using analytics tools. Hence, accessing the data should be a simple process. These following user-interface features will help make data user-friendly.

Dashboards- Dashboards makes data comprehensible even for non-techies as it makes data visually-pleasing. It provides at-a glance view of the key metrics, like lead generation rates and profitability sliced using demographics. Different datasets can be viewed in one place.

Scorecards- A scorecard is a type of report that compares a person’s performance against his goals. It measures success based on Key Performance Indicators (KPIs) and aids in keeping members accountable.

Automated reports- Primary stakeholders should be provided automated reports via mails on a daily basis so that they have access to all the relevant information.

Data analytics should encompass all departments of a credit union. This will help drawing better insights and improve KPI tracking. Thus, the overall performance of the credit union will become better and more efficient with time.

Technologies that help organizations draw valuable insights from their data are becoming very popular. To know more about these technologies follow Dexlab Analytics- a premier institute providing business analyst training courses in Gurgaon and do take a look at their credit risk modeling training course.