In our previous blog we discussed about few of the basic functions of MQL like .find() , .count() , .pretty() etc. and in this blog we will continue to do the same. At the end of the blog there is a quiz for you to solve, feel free to test your knowledge and wisdom you have gained so far.

Given below is the list of functions that can be used for data wrangling:-

updateOne() :- This function is used to change the current value of a field in a single document.

After changing the database to “sample_geospatial” we want to see what the document looks like? So for that we will use .findOne() function.

Now lets update the field value of “recrd” from ‘ ’ to “abc” where the “feature_type” is ‘Wrecks-Visible’.

Now within the .updateOne() funtion any thing in the first part of { } is the condition on the basis of which we want to update the given document and the second part is the changes which we want to make. Here we are saying that set the value as “abc” in the “recrd” field . In case you wanted to increase the value by a certain number ( assuming that the value is integer or float) you can use “$inc” instead.

2. updateMany() :- This function updates many documents at once based on the condition provided.

3. deleteOne() & deleteMany() :- These functions are used to delete one or many documents based on the given condition or field.

4. Logical Operators :-

“$and” : It is used to match all the conditions.

“$or” : It is used to match any of the conditions.

The first code matches both the conditions i.e. name should be “Wetpaint” and “category_code” should be “web”, whereas the second code matches any one of the conditions i.e. either name should be “Wetpaint” or “Facebook”. Try these codes and see the difference by yourself.

So, with that we come to the end of the discussion on the MongoDB Basics. Hopefully it helped you understand the topic, for more information you can also watch the video tutorial attached down this blog. The blog is designed and prepared by Niharika Rai, Analytics Consultant, DexLab AnalyticsDexLab Analytics offers machine learning courses in Gurgaon. To keep on learning more, follow DexLab Analytics blog.

MongoDB is a document based database program which was developed by MongoDB Inc. and is licensed under server side public license (SSPL). It can be used across platforms and is a non-relational database also known as NoSQL, where NoSQL means that the data is not stored in the conventional tabular format and is used for unstructured data as compared to SQL and that is the major difference between NoSQL and SQL. MongoDB stores document in JSON or BSON format. JSON also known as JavaScript Object notation is a format where data is stored in a key value pair or array format which is readable for a normal human being whereas BSON is nothing but the JSON file encoded in the binary format which is quite hard for a human being to understand. Structure of MongoDB which uses a query language MQL(Mongodb query language):- Databases:- Databases is a group of collections. Collections:- Collection is a group fields. Fields:- Fields are nothing but key value pairs Just for an example look at the image given below:-

Here I am using MongoDB Compass a tool to connect to Atlas which is a cloud based platform which can help us write our queries and start performing all sort of data extraction and deployment techniques. You can download MongoDB Compass via the given link https://www.mongodb.com/try/download/compass

In the above image in the red box we have our databases and if we click on the “sample_training” database we will see a list of collections similar to the tables in sql.

Now lets write our first query and see what data in “companies” collection looks like but before that select the “companies” collection.

Now in our filter cell we can write the following query:-

In the above query “name” and “category_code” are the key values also known as fields and “Wetpaint” and “web” are the pair values on the basis of which we want to filter the data. What is cluster and how to create it on Atlas? MongoDB cluster also know as sharded cluster is created where each collection is divided into shards (small portions of the original data) which is a replica set of the original collection. In case you want to use Atlas there is an unpaid version available with approximately 512 mb space which is free to use. There is a pre-existing cluster in MongoDB named Sandbox , which currently I am using and you can use it too by following the given steps:- 1. Create a free account or sign in using your Google account on https://www.mongodb.com/cloud/atlas/lp/try2-in?utm_source=google&utm_campaign=gs_apac_india_search_brand_atlas_desktop&utm_term=mongodb%20atlas&utm_medium=cpc_paid_search&utm_ad=e&utm_ad_campaign_id=6501677905&gclid=CjwKCAiAr6-ABhAfEiwADO4sfaMDS6YRyBKaciG97RoCgBimOEq9jU2E5N4Jc4ErkuJXYcVpPd47-xoCkL8QAvD_BwE 2. Click on “Create an Organization”. 3. Write the organization name “MDBU”. 4. Click on “Create Organization”. 5. Click on “New Project”. 6. Name your project M001 and click “Next”. 7. Click on “Build a Cluster”. 8. Click on “Create a Cluster” an option under which free is written. 9. Click on the region closest to you and at the bottom change the name of the cluster to “Sandbox”. 10. Now click on connect and click on “Allow access from anywhere”. 11. Create a Database User and then click on “Create Database User”. username: m001-student password: m001-mongodb-basics 12. Click on “Close” and now load your sample as given below :

Loading may take a while…. 13. Click on collections once the sample is loaded and now you can start using the filter option in a similar way as in MongoDB Compass In my next blog I’ll be sharing with you how to connect Atlas with MongoDB Compass and we will also learn few ways in which we can write query using MQL.

So, with that we come to the end of the discussion on the MongoDB. Hopefully it helped you understand the topic, for more information you can also watch the video tutorial attached down this blog. The blog is designed and prepared by Niharika Rai, Analytics Consultant, DexLab AnalyticsDexLab Analytics offers machine learning courses in Gurgaon. To keep on learning more, follow DexLab Analytics blog.

In this particular blog we will discuss about few of the basic functions of MQL (MongoDB Query Language) and we will also see how to use them? We will be using MongoDB Compass shell (MongoSH Beta) which is available in the latest version of MongoDB Compass.

Connect your Atlas cluster to your MongoDB Compass to get started. Latest version of MongoDB Compass will have this shell, so if you don’t find this shell then please install the latest version for this to work.

Now lets start with the functions.

find() :- You need this function for data extraction in the shell.

In the shell we need to first write the “use database name” code to access the database then use .find() to extract data which has name “Wetpaint”

For the above query we get the following result:-

The above result brings us to another function .pretty() .

2. pretty() :- this function helps us see the result more clearly.

Try it yourself to compare the results.

3. count() :- Now lets see how many entries we have by the company name “Wetpaint”.

So we have only one document.

4. Comparison operators :-

“$eq” : Equal to

“$neq”: Not equal to

“$gt”: Greater than

“$gte”: Greater than equal to

“$lt”: Less than

“$lte”: Less than equal to

Lets see how this works.

5. findOne() :- To get a single document from a collection we use this function.

6. insert() :- This is used to insert documents in a collection.

Now lets check if we have been able to insert this document or not.

Notice that a unique id has been added to the document by default. The given id has to be unique or else there will be an error. To provide a user defined id use “_id”.

So, with that we come to the end of the discussion on the MongoDB. Hopefully it helped you understand the topic, for more information you can also watch the video tutorial attached down this blog. The blog is designed and prepared by Niharika Rai, Analytics Consultant, DexLab AnalyticsDexLab Analytics offers machine learning courses in Gurgaon. To keep on learning more, follow DexLab Analytics blog.

This is another blog added to the series of time series forecasting. In this particular blog I will be discussing about the basic concepts of ARIMA model.

So what is ARIMA?

ARIMA also known as Autoregressive Integrated Moving Average is a time series forecasting model that helps us predict the future values on the basis of the past values. This model predicts the future values on the basis of the data’s own lags and its lagged errors.

When a data does not reflect any seasonal changes and plus it does not have a pattern of random white noise or residual then an ARIMA model can be used for forecasting.

There are three parameters attributed to an ARIMA model p, q and d :-

p :- corresponds to the autoregressive part

q:- corresponds to the moving average part.

d:- corresponds to number of differencing required to make the data stationary.

In our previous blog we have already discussed in detail what is p and q but what we haven’t discussed is what is d and what is the meaning of differencing (a term missing in ARMA model).

Since AR is a linear regression model and works best when the independent variables are not correlated, differencing can be used to make the model stationary which is subtracting the previous value from the current value so that the prediction of any further values can be stabilized . In case the model is already stationary the value of d=0. Therefore “differencing is the minimum number of deductions required to make the model stationary”. The order of d depends on exactly when your model becomes stationary i.e. in case the autocorrelation is positive over 10 lags then we can do further differencing otherwise in case autocorrelation is very negative at the first lag then we have an over-differenced series.

The formula for the ARIMA model would be:-

To check if ARIMA model is suited for our dataset i.e. to check the stationary of the data we will apply Dickey Fuller test and depending on the results we will using differencing.

In my next blog I will be discussing about how to perform time series forecasting using ARIMA model manually and what is Dickey Fuller test and how to apply that, so just keep on following us for more.

So, with that we come to the end of the discussion on the ARIMA Model. Hopefully it helped you understand the topic, for more information you can also watch the video tutorial attached down this blog. The blog is designed and prepared by Niharika Rai, Analytics Consultant, DexLab AnalyticsDexLab Analytics offers machine learning courses in Gurgaon. To keep on learning more, follow DexLab Analytics blog.

ARMA(p,q) model in time series forecasting is a combination of Autoregressive Process also known as AR Process and Moving Average (MA) Process where p corresponds to the autoregressive part and q corresponds to the moving average part.

Autoregressive Process (AR) :- When the value of Y_{t} in a time series data is regressed over its own past value then it is called an autoregressive process where p is the order of lag into consideration.

Where,

Y_{t} = observation which we need to find out.

α_{1}= parameter of an autoregressive model

Y_{t-1}= observation in the previous period

u_{t}= error term

The equation above follows the first order of autoregressive process or AR(1) and the value of p is 1. Hence the value of Y_{t} in the period ‘t’ depends upon its previous year value and a random term.

Moving Average (MA) Process :- When the value of Y_{t} of order q in a time series data depends on the weighted sum of current and the q recent errors i.e. a linear combination of error terms then it is called a moving average process which can be written as :-

y_{t} = observation which we need to find out

α= constant term

β_{ut-q}= error over the period q .

ARMA (Autoregressive Moving Average) Process :-

The above equation shows that value of Y in time period ‘t’ can be derived by taking into consideration the order of lag p which in the above case is 1 i.e. previous year’s observation and the weighted average of the error term over a period of time q which in case of the above equation is 1.

How to decide the value of p and q?

Two of the most important methods to obtain the best possible values of p and q are ACF and PACF plots.

ACF (Auto-correlation function) :- This function calculates the auto-correlation of the complete data on the basis of lagged values which when plotted helps us choose the value of q that is to be considered to find the value of Y_{t}. In simple words how many years residual can help us predict the value of Y_{t} can obtained with the help of ACF, if the value of correlation is above a certain point then that amount of lagged values can be used to predict Y_{t}.

Using the stock price of tesla between the years 2012 and 2017 we can use the .acf() method in python to obtain the value of p.

.DataReader() method is used to extract the data from web.

The above graph shows that beyond the lag 350 the correlation moved towards 0 and then negative.

PACF (Partial auto-correlation function) :- Pacf helps find the direct effect of the past lag by removing the residual effect of the lags in between. Pacf helps in obtaining the value of AR where as acf helps in obtaining the value of MA i.e. q. Both the methods together can be use find the optimum value of p and q in a time series data set.

Lets check out how to apply pacf in python.

As you can see in the above graph after the second lag the line moved within the confidence band therefore the value of p will be 2.

So, with that we come to the end of the discussion on the ARMA Model. Hopefully it helped you understand the topic, for more information you can also watch the video tutorial attached down this blog. The blog is designed and prepared by Niharika Rai, Analytics Consultant, DexLab AnalyticsDexLab Analytics offers machine learning courses in Gurgaon. To keep on learning more, follow DexLab Analytics blog.

The theory of estimation is a branch in statistics that provides numerical values of the unknown parameters of the population on the basis of the measured empirical data that has a random component. This is a process of guessing the underlying properties of the population by observing the sample that has been taken from the population. The idea behind this is to calculate and find out the approximate values of the population parameter on the basis of a sample statistics.

Population:- All the items in any field of inquiry constitutes to a “Population”. For example all the employees of a factory is a population of that factory and the population mean is represented and the size of the population is represented by N.

Sample:- Selection of few items from the population constitutes to a sample and the mean of the sample is represented by and the sample size is represented by n

Statistics:- Any statistical measure calculated on the basis of sample observations is called Statistic. Like sample mean, sample standard deviation, etc.

Estimator:- In general estimator acts as a rule, a measure computed on the basis of the sample which tells us how to calculate the values of the estimate. It is a functional form of all sample observations prorating a representative value of the collected sample.

Suppose we have a random sample x_1,x_2,…,x_n on a variable x, whose distribution in the population involves an unknown parameter. It is required to find an estimate of on the basis of sample values.

Unbiasedness:-A statistic t is said to be an unbiased estimator if E(β ̂)= βi.e. observed value is equal to the expected value. In case E(β ̂)≠ β then the estimator is biased estimator.

Consistency:- One of the most desirable property of good estimator is that its accuracy should increase when the sample becomes larger i.e. the error between the expected value and the observed value reduces as the size of the sample increases E(β ̂ )- β=0

Efficiency:-An estimator is said to be an efficient estimator if it has the smallest variance compared to all the consistent and unbiased estimators. If consistent estimator exists whose sampling variance is less than that of any other consistent estimator, it is said to be “most efficient”; and it provides a standard for the measurement of ‘efficiency’ of a statistic.

Sufficiency:- An estimator is said to be sufficient if it contains all information in the sample about .

At the end of this discussion, hopefully, you have learned what theory of estimation is. Watch the video tutorial attached below to learn more about this. DexLab Analytics is a data science training institute in gurgaon, that offers advanced courses. Follow the blog section to access more informative posts like this.

NumPy also known as numerical python, is a library consisting of multidimensional array objects and a collection of routines for processing those arrays. Using NumPy, mathematical and logical operations on arrays can be performed without it which was not possible. For example-

Multiplication of two lists will cause an error as a data structure like lists, tuple, dictionaries and sets do not allow mathematical operations.

Therefore we need NumPy to covert our data structures like lists into 1d, 2d, 3d or nd arrays so that mathematical operations can be performed. U

We can use .array() methods to create these arrays.

Now let’s check out few examples and also perform few mathematical operations to have a better understanding.

In the above code we first import NumPy library and then use .array() method to two 1d-array a1 and b1 using the list we previously created.

Now let’s multiply a1 and b1 array.

Now let’s use .array() method to directly create an array.

Arrays can be created using lists, tuples and dictionaries as you can see in the above example.

Now for 2-d arrays recall that we can also make list of lists. Let’s use that to create 2d-arrays.

2d-arrays can also be created using tuples.

Remember that we are not using these as matrices because matrix multiplication is an entirely different thing we are just trying to perform mathematical operations which were otherwise not possible.

Random Module

Numpy also has various ways with which we can create array of random numbers which then can be used in number of ways like generating a data for practice purposes or for building beautiful graphs for a presentation.

Given below is a list of type of random numbers you can generate

.rand() :- This particular method helps you generate uniformly distributed random numbers i.e. numbers between 0 and 1 where each number between 0 and 1 will have equal probability to be in the sample dataset.

The above code generates a 2d-array with values between 0 and 1.

.randn():- This method generates normally distributed random numbers i.e. numbers between -3 and +3 where mean=median=mode and ploted gives a bell shaped curve.

Here the 20 random numbers are generated ranging between -3 and + 3.

Note:- Remember that the data is randomly picked from the normally distributed values between -3 and +3 so the graph is not bell shaped but the original data from which the values are being picked randomly is bell shaped with mean=median-mode.

.randint():-This method generates random integers between a given range.

So, with that we come to the end of the discussion on the Numpy. Hopefully it helped you understand Numpy, for more information you can also watch the video tutorial attached down this blog. DexLab Analytics offers machine learning courses in delhi. To keep on learning more, follow DexLab Analytics blog.

In our previous blog we studied about the basic concepts of Linear Regression and its assumptions and let’s practically try to understand how it works.

Given below is a dataset for which we will try to generate a linear function i.e.

y=b_{0}+b_{1}Xi

Where,

y= Dependent variable

Xi= Independent variable

b_{0} = Intercept (coefficient)

b_{1 }= Slope (coefficient)

To find out beta (b_{0}& b_{1}) coefficients we use the following formula:-

Let’s start the calculation stepwise.

First let’s find the mean of x and y and then find out the difference between the mean values and the X_{i} and Y_{i}e. (x-x ̅ ) and (y-y ̅ ).

Now calculate the value of (x-x ̅ )^{2} and (y-y ̅ )^{2}. The variation is squared to remove the negative signs otherwise the summation of the column will be 0.

Next we need to see how income and consumption simultaneously variate i.e. (x-x ̅ )* (y-y ̅ )

Now all there is left is to use the above calculated values in the formula:-

As we have the value of beta coefficients we will be able to find the y ̂(dependent variable) value.

We need to now find the difference between the predicted y ̂ and observed y which is also called the error term or the error.

To remove the negative sign lets square the residual.

What is R^{2 }and adjusted R^{2} ?

R^{2} also known as goodness of fit is the ratio of the difference between observed y and predicted and the observed y and the mean value of y.

Hopefully, now you have understood how to solve a Linear Regression problem and would apply what you have learned in this blog. You can also follow the video tutorial attached down the blog. You can expect more such informative posts if you keep on following the DexLab Analytics blog. DexLab Analytics provides data Science certification courses in gurgaon.

Today’s blog explores another vital statistical concept Linear Regression, let’s begin. Linear regression is normally used in statistics for predictive modeling. It tries to model a relationship between two independent (explanatory variable) and dependent (explained variable) variables X and Y by fitting a linear equation (Y=b_{o}+b_{1}X+U_{i}) to an observed data.

Assumptions of linear regression

U_{i} is a random real variable, where U_{i }is the difference between the observed dependent variable Y and predicted Y variable.

The mean of U_{i }in any particular period is zero.

The variance of U_{i} is constant in each period i.e for all values of X, U_{i} will show the same dispersion around their mean

The variable U_{i} has a normal distribution i.e the value of U_{i} (for each X_{i}) have a bell shaped symmetrical distribution about their zero mean.

The random terms of different observations are independent i.e the covariance of any U_{i }with any other U_{j} is equal to zero.

U_{i} is independent of the explanatory variable X.

X_{i }are a set of fixed values in the hypothesised process of repeated sampling which underlies the linear regression model.

In case there are more than one explanatory variables then they are not perfectly linearly correlated.

Linear Regression equation can be written as:

Where,

is the dependent variable

X is the independent variable.

b_{0 }is the intercept (where the line crosses the vertical y-axis)

b_{1 }is the slope

U_{i} is the error term (difference between ) also called residual or white noise.

Simple linear regression follows the properties of Ordinary Least Square (OLS) which are as follows:-

Unbiased estimator:- E()=b ie. an estimator is unbiased if its bias is 0; E() – b = 0

Minimum Variance:- An estimate is best when it has the smallest variance as compared to any other estimate obtained from other econometric method.

Efficient estimator:- When it has both the previous properties ie.

Linear estimator

Best, Linear, Unbiased estimator (BLUE)

Minimum mean squared error (MSE) estimator:- It is a combination of the unbiasedness and minimum variance properties. An estimator is a minimum MSE estimator if it has the smallest mean square error.

With that the discussion on Linear Regression wraps up here, hopefully it cleared away any confusion you might have and helped you get a grasp on the concept. We have a video discussion on this same topic, which is attached below this blog, check it out for further reference.