online courses Archives - Page 13 of 16 - DexLab Analytics | Big Data Hadoop SAS R Analytics Predictive Modeling & Excel VBA

How to Append Data to Add Markers to SAS Graphs

How to Append Data to Add Markers to SAS Graphs

Would you like to create customized SAS graphs with the use of PROC SGPLOT and other ODS graphic procedures? Then an essential skill that you must learn is to know how to join, merge, concentrate and append SAS data sets, which arise from a variety of sources. The SG procedures, which stand for SAS statistical graphic procedures, enable users to overlay different kinds of customized curves, bars and markers. But the SG procedures do expect all the data for a graph to be in one single set of data. Thus, it often becomes necessary to append two or more sets of data before one can create a complex graph.

In this blog post, we will discuss two ways in which we can combine data sets in order to create ODS graphics. An alternative option is to use the SG annotation facility, which will add extra curves and markers to the graph. We mostly recommend the use of the techniques that are given in this article for simple features and reserve annotations when adding highly complex yet non-standard features.

Using overlay curves:

Here is a brief idea on how to structure a SAS data set, so that one can overlay curves on a scatter plot.

The original data is contained in the X and Y variables, as can be seen from the picture below. These will be the coordinates for the scatter plot. The secondary information will be appended at the end of the data. The variables X1 and Y1 contain the coordinates of a custom scatter plot smoother. The X2 and Y2 variables contain the coordinates of another scatter plot smoother.

sgplotoverlay
Source: blogs.sas.com

This structure will enable you to use the SGPLOT procedure for overlaying, two curves on the scatter plot. One may make use of a SCATTER statement along with two SERIES statements to build the graphs.

With the right Retail Analytics Courses, you can learn to do the same, and much more with SAS.

Using Overlay Markers: Wide form

Sometimes in addition to the overlaying curves, we like to add special markers to the scatter plot. In this blog we plan to show people how to add a marker that shows the location of the sample mean. It will discuss how to use PROC MEANS to build an output data set, which contains the coordinates of the sample mean, then we will append the data set to the original data.

With the below mentioned statements we can use PROC MEANS for computing the sample mean of the four variables in the data set of SasHelp.Iris. This data contains the measurements for 150 iris flowers. To further emphasize on the general syntax of this computation, we will make use of macro variables but note that it is not necessary:

%let DSName = Sashelp.Iris;
%let VarNames = PetalLength PetalWidth SepalLength SepalWidth;
  
proc means data=&DSName noprint;
var &VarNames;
output out=Means(drop=_TYPE_ _FREQ_) mean= / autoname;
run;

With the AUTONAME option on the output statement, we can tell PROC MEANS to append the name of the statistics to names of the variables. As a result, the output datasets will contain the variables, with names like PetalLength_Mean or SepalWidth_Mean.

As depicted in the previous picture, this will enable you to append the new data into the end of the old data in the “wide form”, as shown here:

data Wide;
   set &DSName Means; /* add four new variables; pad with missing values */
run;
 
ods graphics / attrpriority=color subpixel;
proc sgplot data=Wide;
scatter x=SepalWidth y=PetalLength / legendlabel="Data";
ellipse x=SepalWidth y=PetalLength / type=mean;
scatter x=SepalWidth_Mean y=PetalLength_Mean / 
         legendlabel="Sample Mean" markerattrs=(symbol=X color=firebrick);
run;

And as here:

Scatter plot with markers for sample means

Source: blogs.sas.com

 

The original data is used in the first SCATTER statement and the ELLIPSE statement. You must remember that the ELLIPSE statement draws an approximate confidence ellipse for the population mean. The second SCATTER statement also makes use of sample means, which must be appended to the end of the original data. The second SCATTER statement will draw a red marker at the location of the sample mean.

This method can be used to plot other sample statistics (like the median) or to highlight special values such as the origin of a coordinate system.

Using overlay markers: of the long form

In certain circumstances, it is better to append the secondary data in the “long form”. In the long form the secondary data sets contains variables similar to the names in the original data set. One can choose to use the SAS data step to build a variable that will pinpoint the original and supplementary observations. With this technique it will be useful when people would want to show multiple markers (like, sample, mean, median, mode etc.) by making use of the GROUP = option on one of the SCATTER statement.

For detailed explanation of these steps and more on such techniques, join our SAS training courses in Delhi.

The following call to the PROC MEANS does not make use of an AUTONAME option. That is why the output data sets contain variables which have the same name as the input data. One can make use of the IN= data set option, for creating the ID variables that identifies with the data from the computed statistics:

/* Long form. New data has same name but different group ID */
proc means data=&DSName noprint;
var &VarNames;
output out=Means(drop=_TYPE_ _FREQ_) mean=;
run;
 
data Long;
set &DSName Means(in=newdata);
if newdata then 
   GroupID = "Mean";
else GroupID = "Data";
run;

The DATA step is used to create the GroupID variable, which has several values “Data” for the original observations and the value “Mean” for the appended observations. This data structure will be useful for calling the PROC SGSCATTER and this will support the GROUP = option, however it does not support multiple PLOT statements as the following:

ods graphics / attrpriority=none;
proc sgscatter data=Long 
   datacontrastcolors=(steelblue firebrick)
   datasymbols=(Circle X);
plot (PetalLength PetalWidth)*(SepalLength SepalWidth) / group=groupID;
run;

Scatter plot matrix with markers for sample means

Source: blogs.sas.com

 

In closing thoughts, this blog is to demonstrate some useful techniques, to add markers to a graph. The technique requires people to use concatenate the original data with supplementary data. Often for creating ODS statistical graphics it is better to use appending and merging data technique in SAS. This is a great technique to include in your programming capabilities.

SAS courses in Noida can give you further details on some more techniques that are worth adding to your analytics toolbox!

 
This post originally appeared onblogs.sas.com/content/iml/2016/11/30/append-data-add-markers-sas-graphs.html
 

Interested in a career in Data Analyst?

To learn more about Data Analyst with Advanced excel course – Enrol Now.
To learn more about Data Analyst with R Course – Enrol Now.
To learn more about Big Data Course – Enrol Now.

To learn more about Machine Learning Using Python and Spark – Enrol Now.
To learn more about Data Analyst with SAS Course – Enrol Now.
To learn more about Data Analyst with Apache Spark Course – Enrol Now.
To learn more about Data Analyst with Market Risk Analytics and Modelling Course – Enrol Now.

What Does The Market Look Like for Hadoop in 2018 – 2022?

What Does The Market Look Like for Hadoop in 2018 – 2022?

It will be a simple understatement to say that Hadoop took the Big Data market up by storm this past years from 2012-2016. This time-period in the history of data witnessed a wave of mergers, acquisitions and high valuation rounds of finances. It will not be a simple exaggeration to state that today Hadoop is the only cost sensible and scalable open-source alternative option against the other commercially available Big Data Management tools and packages.

Recently it has not only emerged as the de-facto for all industry standard business intelligence (BI), and has become an integral part of almost all commercially available Big Data solutions.

Until 2015, it had become quite clear that Hadoop did fail to deliver in terms of revenues. From 2012 to 2015, the growth and development of Hadoop systems have been financed by venture capitalists mostly. It also made some funds through acquisition money and R&D project budgets.

But it is no doubt that Hadoop talent is sparse and also does not come in cheap. Hadoop smarts a steep learning curve that most cannot manage to climb. Yet, still more and more enterprises are finding themselves be attracted towards the gravitational pull of this massive open-source system, of Hadoop. It is mostly due to the functionality that it offers. Several interesting trends have emerged in the Hadoop market within the last 2 years like:

  • The transformation from batch processing to online processing
  • The emergence of MapReduce alternatives like Spark, DataTorrent and Storm
  • Increasing dissatisfaction among the people with the gap between SQL-on-Hadoop and the present provisions
  • Hadoop’s case will further see a spur with the emergence of IoT
  • In-house development and deployment of Hadoop
  • Niche enterprises are focussing on enhancing Hadoop features and its functionality like visualization features, governance, ease of use, and its way to ease up to the market.

While still having a few obvious setbacks, it is of no doubt that, Hadoop is here to stay for the long haul. Moreover, there is rapid growth to be expected in the near future.

Hadoop+the+Next+Big+Thing+in+India_2

Image Source: aws.amazon.com

As per market, forecasts the Hadoop market is expected to grow at CAGR (compounded annual growth rate) of 58% thereby surpassing USD 16 billion by 2020.

The major players in the Hadoop industry are as follows: Teradata Corporation, Rainstor, Cloudera, Inc. and Hortonworks Inc., Fujitsu Ltd., Hitachi Data Systems, Datameer, Inc., Cisco Systems, Inc., Hewlett-Packard, Zettaset, Inc., IBM, Dell, Inc., Amazon Web Services, Datastax, Inc., MapR Technologies, Inc., etc.

Several opportunities are emerging for Hadoop market with the changing global environment where Big Data is affecting the IT businesses in the following two ways:

  1. The need to accommodate this exponentially increasing amount of data (storage, analysis, processing)
  2. Increasingly cost-prohibitive models for pricing that are being imposed by the established IT vendors

010516Yelamaneni1

Image Source: tdwi.org

The forecast for Hadoop market for the years 2017-2022 can be summarised as follows:

  1. Hadoop market segment as per geographical factors: EMEA, America and Asia/Pacific
  2. As per software and hardware services: commercially supported software for Hadoop, Hadoop appliances and hardware, Hadoop services (integration, consulting, middleware, and support), outsourcing and training
  3. By verticals
  4. By tiers of data (quantity of data managed by organizations)
  5. As per application: advanced/predictive analysis, ETL/data integration, Data mining/visualization. Social media and click stream analysis. Data warehouse offloading; IoT (internet of things) and mobile devices. Active archives along with cyber security log analysis.

010516Yelamaneni2

Image Source: tdwi.org

This chain link graph shows that each component in an industry is closely linked to data analytics and management and plays an equally important role in generating business opportunities and better revenue streams.

Enjoy 10% Discount, As DexLab Analytics Launches #BigDataIngestion
DexLab Analytics Presents #BigDataIngestion

Contact Us Through Our Various Social Media Channels Or Mail To Know More About Availing This Offer!

 

THIS OFFER IS FOR COLLEGE STUDENTS ONLY!

 

Interested in a career in Data Analyst?

To learn more about Machine Learning Using Python and Spark – click here.
To learn more about Data Analyst with Advanced excel course – click here.
To learn more about Data Analyst with SAS Course – click here.
To learn more about Data Analyst with R Course – click here.
To learn more about Big Data Course – click here.

Understanding Credit Risk Management With Modelling and Validation

The term credit risk encompasses all types of default risks that are associated with different financial instruments such as – (like for example, a debtor has not met his or her legal duties according to the debt contract), migrating risk (arises from adverse movements internally or externally with the ratings) and country risks (the debtor cannot pay as per the duties because of measure or events taken by political or monetary agencies of the country itself).

In compliance to Basel Regulations, most banks choose to develop their own credit risk measuring parameters: Probability Default (PD), Loss Given Default (LGD), and Exposure at Default (EAD). Several MNCs have gathered solid experience by developing models for the Internal Ratings Based Approach (IRBA) for different clients.

For implementation of these Credit Risk Assessment parameters, we need the following data analytics and visualization tools:

  • SAS Credit Risk modelling for banking
  • SA Enterprise miner and SAS Credit scoring
  • Matlab
Default Probability Curve for Each Counterparty
                                                                               Image Source: businessdecision.be

Credit and counterparty risk validating:

The models that are built for the computation of risks must be revalidated on a regular basis.

On one hand, the second pillar of the Basel regulations implies that supervisors should check that their risk models are working consistently for optimum results. On the other hand, recent crises have drawn the focus of the stakeholders of the banks (business, CRO) to a higher interest on the models.

The process of validation includes in a review of the development process and all the related aspects of model implementation. The process can be divided into two parts:

  1. Quality control is mainly concerned about the ongoing monitoring of the model in use, the quality of the input variables, judgemental decisions and the resulting output models.
  2. Quantitatively with backresting, we can statistically compare the periodic risk parameters with its actual outcomes.

In the context of credit risk, the process of validation is concerned with three main parameters they are – probability of default (PD), exposure at default (EAD) and the loss given default (LGD). And for all of the above mentioned three a complete backresting is done at the three levels:

  1. Discriminatory power: this is the ability of the model to differentiate between defaults, non-defaults, or between high-losses and low losses.
  2. Power of prediction: this is a checking using comparison between defaults and non-defaults, or between high losses and low losses.
  3. Stability: is the portfolio change between the time when the model was first developed and now.

In the below three X three matrix (parameter X level) each and every component has had one or more standardized tests to process. With the right Credit Risk Modelling training an individual can implement all the above tests and provide for the needful reporting of the same.

In terms of the counterparty credit risk context, one must consider the uncertainty of exposure and the bilateral nature of the risk associated. Hence, exposure at the default can be replaced by the EPE (expected positive exposure) and EEPE (effective expected positive exposure).

The test include comparing the observed P&L with the EEPE (make sure the violations are moderate and the pass rate does not exceed a predetermined level for instance 70%).

Deep Learning and AI using Python

For better visualization, here is an example of the same:

For better visualization, here is an example of the same:
                                                                  Image Source: businessdecision.be

Risk models:

As per the National Bank of Belgium, which is he Belgian regulator (NBB), it insists that appropriate conservative measures should be incorporated to compensate for the discrepancies of the value and risk models. For example, as per the NBB requisites there should be an assessment of the model risk, which is based on the inventory of:

  1. The risk that model covers, along with an assessment of the quality of the results calculated by the model (maturity of the model, adequacy of assumptions made, weaknesses and limitations of the model, etc) and the improvements that are planned to be included over time.
  2. The risks that are not yet be covered by the model along with an assessment of the materiality of these risks and their process of handling the same.
  3. The elements that are covered by a general modelling method along with the entities that are covered by a more simplified method, or the ones that are not covered at all.

A quality Credit Risk Management Course can provide you with the necessary functional and technical knowledge to assess the model risk.

 

Interested in a career in Data Analyst?

To learn more about Data Analyst with Advanced excel course – Enrol Now.
To learn more about Data Analyst with R Course – Enrol Now.
To learn more about Big Data Course – Enrol Now.

To learn more about Machine Learning Using Python and Spark – Enrol Now.
To learn more about Data Analyst with SAS Course – Enrol Now.
To learn more about Data Analyst with Apache Spark Course – Enrol Now.
To learn more about Data Analyst with Market Risk Analytics and Modelling Course – Enrol Now.

Black Money in India Can be Traced With Ease by Applying Big Data Analytics

The economy took a hit with the recent demonetization of the INR 500 and 1000 currency notes. The jury of economists around the world are still debating whether the move was good or not, but it has definitely caused a huge inconvenience for the public. Moreover, exchanging such a large amount of old currency notes is nothing shy of a mammoth Herculean task, as almost 85 percent of the economy is in the form of high denomination currency.

Black Money in India Can be Traced With Ease by Applying Big Data Analytics
                Black Money in India Can be Traced With Ease by Applying Big Data Analytics

 

These measures have been taken by the government to curb the flow of Black Money in India and get rid of corruption from its roots. While there is still a mixed reaction from the common people about this move about it being good or bad, technological experts have a different viewpoint about preventing the flow of Black Money in the country.  They say that with use modern technologies like Big Data Analytics it will be possible to trace Black Money painlessly and with much ease.

Continue reading “Black Money in India Can be Traced With Ease by Applying Big Data Analytics”

How to Use Data Analysis For SEO and PPC:

How to Use Data Analysis For SEO and PPC:

Using custom functions in Excel VBA:

When you work in the SEO and PPC industry it is a giveaway that you will be handing a large amount of data. While there are several ways you can utilize this data and manage it with Excel functions, and several tutorials are available online to talk about them. But what if you do not have the functions on Excel to do what you have to do with the data. You can use the Visual Basic for Applications (VBA) feature in MS Excel and write your own functions to help Excel carry out the functions that you want it to.

So, here in this advanced Excel training blog post, we will discuss about how to write a simple custom Excel function and will also give you readers some general advice on how to get started with Excel VBA.

Getting started with the Excel VBA editor:

First in order to work with the VBA editor in Excel, you must open a new Excel workbook or document and then press the following keys on your keyboard – ALT + F11. This will open a new window on the screen which is the VBE (Visual Basic Editor). This is where you can write your own Excel functions to use with the spreadsheet you have opened in your Excel document. This will be highlighted on the top left corner of the window. The project explorer pane will have the icons for each sheet of the document and another one for the whole of the workbook itself.   Then for the next step, right-click on the ‘ThisWorkbook’, and then go to ‘Insert’ and then ‘Module’ options. That will further add a code module along with a container for the code which we will learn to write here.

Now you are ready to write your first Excel function:

2

Data analysis will help you analyze the keywords:

Each element of the data gathered through the SEO and PPC will often have keywords and phrases and this can give birth to a large amount of data for people to work with. For a recent piece of analysis, our faculty member was asked to find a method for counting the number of words in a search term. In this way single keywords can be dealt with differently in comparison to phrases. Like for e.g. ‘dresses’ can be treated with a stark difference to the term ‘red party dresses’. But there are often 100s or even 1000s of keywords to work with and it will be too time consuming to manually count the number of words in each phrase. Also there are no in-built functions in MS Excel to do so for us. Hence, we must use VBA to write new functions for us.

Adding the code:

  Function countWords(phrase as string) as Integer  

This will be the first line of the function you are about to write, start by copying it into the module we just created. Copy it under the words ‘Option Explicit’ which should be anyway entered (if it is not then do not worry, just copy it at the top and we can come to this later). This sentence however, has a lot of important things to tell us about.

Function: this first word itself tells us about which code is going to follow. A function is simply a piece of code that takes one or more values, performs something with them and then returns a different value. For instance, there is a built-in function with Excel called SUM. This function may take some input values and add them together to return a different value which is the sum total of the inputs. Similarly the function we create will take the keywords or phrases as an input and then count the number of words in them, then return a value for that number.

CountWords: we have put this as the name of our function. The moment we wish to use it, we can simply input into the spreadsheet cell the words as ‘countWords’. Just like we would add ‘SUM’ to use the sum function.

Phrases as string: this is the input will be the one to be entered when we will need to search a keyword or phrase.

As integer: this is the type of information which will be returned by the function. We are only interested in the number of the whole words in the phrases and hence are aiming to return an integer value.

How to prepare the function:

The next thing to do is to prepare the function by declaring the variables. Here we will declare the variables in ‘countWords’ as integers because it is built to only take integers. This will allow Excel to warn us if anything unexpected happens. For example, if we want to use a function to count the words in ‘red party dresses’ and it only returns with party. This will mean that something has gone wrong for sure. So, with declaration of the variable we will be able to let Excel know that it is not an integer and hence it will return with an error warning.

The variables we will use in this function are going to be called as ‘I’ and ‘counter’, however, there is no hard and fast rule to name your variables this way, you can name it the way you like. But ‘I’ is usually used as an abbreviation for index and counter will just be used as counter. The next step will be to add this line into your code.

‘Counts number of occurrences of space character in a phraseDim i as integerDim counter as integer: counter = 1

Note that ‘dim’ here is short for dimension. This describes the data type of a variable. We have told Excel through our codes that the variable ‘counter’ will always be an integer. We have also given the initial value as 1. But currently ‘I’ has no value assigned to it. The first line should appear green in the code window, this is mostly because of the apostrophe that precedes it. This line in our code is merely a comment and does not do anything within the code. It only exists as a label to let us know what the use of the code is for. It is a good practice to comment your code as otherwise it often becomes very hard to understand it otherwise. Also feel free to add in your own comments throughout to help understand and all you have to do for it is to add an apostrophe before it.

How can you count the words?

You must understand that Excel has no preconceived notions about what a word is. So to count them the concept has to be broken down for it to understand in a few short steps. One of the key features of a word is that it has a space either after or before or even at times in between it, and often both.

So we can start by simply telling Excel to count these spaces:

For i = 1 To Len(phrase)

If Mid(phrase, i, 1) = ” ” Thencounter =

counter + 1

End If

Next i

This is one of the key areas of this function, you must paste it or type it out in the code module. You can do so line-by-line as well. But here is an explanation of what is happening with each step:

For i = 1 To Len(phrase)

Here we have given ‘i’ a value, in fact not just one value but a range of values from 1 to Len (phrase). This is a built-in function with Excel that may return the number of characters (letters + spaces) in the phrase we pass it in.
 
f Mid(phrase, i, 1) = ” ” Then

With this line of the code we are using the ‘Mid’ function in excel. This will ask excel to look into each character in the phrases in turn. This function takes 3 inputs which is – the phrases to be looked at, the character to begin comparison on, and the number of characters to compare with. We aim to compare every letter with one at a time approach. So, we would pass on ‘I’ and 1. And then finally the ‘If’ statement which says that if a character uses spaces, then excel should proceed to the next line of code. Or pass it over to the ‘End If’ statement.

counter = counter + 1

This line is only activated when a space is discovered. So, we increase our counter variables by 1 every time to count the number of spaces in the phrase.

End If
Next i

With the above two lines, we are able to let Excel know where the If statement ends and to go back to the top and the start again for the next value of ‘i’. This is called as a ‘For Loop’ as we letting Excel know that it must repeat this task for a certain number of iterations.

There is also one last piece of code which we will make use of in order to handle a particular situation. When the phrase is passed in is blank. Then copy the following with what you already have:

If phrase = “” Then

countWords = 0

Else

countWords = counter

End If
 

Here is another statement that we have. If the phrase we input is blank, countWords takes the value 0, or else it will take the value of the ‘counter’ variable. After setting the ‘counter’ to 1 initially, we ensure the code will work for single words. However, it may also return 1 for blank phrases, and this prevents errors from occurring.

End Function

Finally with that we tell excel that we have finished defining our function. Here is the full code as mentioned below, check if yours looks the same or not:

How to Use Data Analysis For SEO and PPC:

Image Source: us.searchlaboratory.com

After you are done, you can close the VBE by clicking on the ‘X’ in the corner and then going back to the spreadsheet. Once done type in some words in a few cells and then type ‘countWords’ in another cell. And then click one of your cells containing the texts and then close the parenthesis. This cell can now contain the number of words in the cell that we have input. If it doesn’t, then we can set it to ‘automatic’ (Formulas > Calculations Options > Automatic in Excel 2010.

This simple function works best to save time as it can be dragged down over as many cells as you’d like, with hundreds of keywords and phrases. However, you must keep its limitations in mind. We are counting the spaces and not just words.

 

Interested in a career in Data Analyst?

To learn more about Data Analyst with Advanced excel course – Enrol Now.
To learn more about Data Analyst with R Course – Enrol Now.
To learn more about Big Data Course – Enrol Now.

To learn more about Machine Learning Using Python and Spark – Enrol Now.
To learn more about Data Analyst with SAS Course – Enrol Now.
To learn more about Data Analyst with Apache Spark Course – Enrol Now.
To learn more about Data Analyst with Market Risk Analytics and Modelling Course – Enrol Now.

Making a Histogram With Basic R Programming Skills

DexLab Analytics over the course of next few weeks will cover the basics of various data analysis techniques like creating your own histogram in R programming. We will explore three options for this: R commands, ggplot2 and ggvis. These posts are for users of R programming who are in the beginner or intermediate level and who require accessible and easy to understand resources.

 

Making a Histogram With Basic R Programming Skills

 

Seeking more information? Then take up our R language training course in Gurgaon from DexLab Analytics.

 

What is a histogram?

A histogram is a category of visual representation of a dataset distribution. As such the shape of a histogram is its most common feature for identification. With a histogram one will be able to see which factor has the relatively higher amount of data and which factors or segments have the least.

 

Or put in simpler terms, one can see where the middle or median is in a data distribution, and how close or farther away the data would lie around the middle and where would the possible outliers be found. And precisely because of all this histograms will be the best way to understand your data.

 

But what can a specific shape of a histogram tell us? In short a typical histogram consists of an x-axis and a y-axis and a few bars of varying heights. The y-axis will exhibit how frequently the values on the x-axis are occurring in the data. The y-axis showcases the frequency of the values on the x-axis where the data occurs, the bar group ranges of either values or continuous categories on the x-axis. And the latter explains why the histograms do not have any gaps between the bars.

 

Let’s Take Your Data Dreams to the Next Level

How can one make a histogram with basic R?

Step 1: Get your eyes on the data:

As histograms require some amount of data to be plotted initially, you can carry that out by importing a dataset or simply using one which is built into the system of R. In this tutorial we will make use of 2 datasets the built-in R dataset AirPassengers and another dataset called as chol, which is stored into a .txt file and is available for download.

Step 2: Acquaint yourself with The Hist () function:

One can make a histogram in R by opting the easy way where they use The Hist () function, which automatically computes a histogram of the given data values. One would put the name of their dataset in between parentheses to use this function.

Here is how to use the function:

hist(AirPassengers)

 

But if in case, you want to select a certain column of a data frame like for instance in chol, for making a histogram. The hist function should be used with the dataset name in combination with a $ symbol, which should be followed by the column name:

 

2

Here is a specimen showing the same:

hist(chol$AGE) #computes a histogram of the data values in the column AGE of the dataframe named “chol”

Step 3: Up the level of the hist () function:

You may find that the histograms created with the previous features seem a little dull. That is because the default visualizations do not contribute much to the understanding of the histograms. One may need to take one more step to reach a better and easier understanding of their histograms. Fortunately, this is not too difficult to accomplish, R has several allowances for easy and fast ways to optimize the visualizations of the diagrams while still making use of the hist () function.

To adapt your histogram you will only need to add more arguments to the hist () function, in this way:

hist(AirPassengers,
     main="Histogram for Air Passengers",
     xlab="Passengers",
     border="blue",
     col="green",
     xlim=c(100,700),
     las=1,
     breaks=5)

This code will help to compute a histogram of data values from the dataset AirPassengers, with the name “Histogram for Air Passengers” as the title. The x-axis would be labelled as ‘Passengers’ and will have a blue border with a green colour to the bins, while limiting the x-axis with a range of 100 to 700 and rotating the printed values on the y-axis by 1 while changing  the bin width by 5.

We know what you are thinking – this is a humungous string of code. But do not worry, let us break it down into smaller pieces to see what each component holds. 

Name/colours:

You can alter the title of the histogram by adding main as an argument to the hist () function.

This is how:

hist(AirPassengers, main=”Histogram for Air Passengers”) #Histogram of the AirPassengers dataset with title “Histogram for Air Passengers”

For adjusting the label of the x-axis you can add xlab as the feature. Similarly one can also use ylab to label the y-axis.

This code would work:

hist(AirPassengers, xlab=”Passengers”, ylab=”Frequency of Passengers”) #Histogram of the AirPassengers dataset with changed labels on the x-and y-axes hist(AirPassengers, xlab=”Passengers”, ylab=”Frequency of Passengers”) #Histogram of the AirPassengers dataset with changed labels on the x-and y-axes

If in case you would want to change the colours of the default histogram you can simply choose to add the arguments border or col. Adjusting would be easy, as the name itself kind of gives away the borders and the colours of the histogram.

hist(AirPassengers, border=”blue”, col=”green”) #Histogram of the AirPassengers dataset with blue-border bins with green filling

Note: you must not forget to put the names and the colours within “ ”.

For x and y axes:

To change the range of the x and y axes one can use the xlim and the ylim as arguments to the hist function ():

The code to be used is:

hist(AirPassengers, xlim=c(100,700), ylim=c(0,30)) #Histogram of the AirPassengers dataset with the x-axis limited to values 100 to 700 and the y-axis limited to values 0 to 30

Point to be noted in this case, is the c() function is used for delimiting the values on the axes when one is suing the xlim and ylim functions. It takes 2 values the first being the begin value and the second being the end value.

Make sure to rotate the labels on the y-axis by adding 1as=1 as the argument, the argument 1as can be 0, 1, 2 or 3.

The code to be used:

hist(AirPassengers, las=1) #Histogram of the AirPassengers dataset with the y-values projected horizontally

 

Depending on the option one chooses the placement of the label will vary: like for instance, if you choose 0 the label will always be parallel to the axis (the one that is the default). And if one chooses 1, The label will be horizontally put. If you want the label to be perpendicular to the axis then pick 2 and for placing it vertically select 3.

For bins:

One can alter the bin width by including breaks as an argument, in combination with the number of breakpoints which one wants to have.

This is the code to be used:

hist(AirPassengers, breaks=5) #Histogram of the AirPassengers dataset with 5 breakpoints

If one wants to have increased control over the breakpoints in between the bins, then they can enrich the breaks arguments by adding in it vector of breakpoints, one can also do this by making use of the c() function.

hist(AirPassengers, breaks=c(100, 300, 500, 700)) #Compute a histogram for the data values in AirPassengers, and set the bins such that they run from 100 to 300, 300 to 500 and 500 to 700.

But the c () function can help to make your code very messy at times, which is why we recommend using add = seq(x,y,z) instead. The values of x, y and z are determined by the user and represented in a specific order of appearance, the starting number of x-axis  and the last number of the same as well as the intervals in which these numbers are to appear.

A noteworthy point to be mentioned here is that one can combine both the functions:

hist(AirPassengers, breaks=c(100, seq(200,700, 150))) #Make a histogram for the AirPassengers dataset, start at 100 on the x-axis, and from values 200 to 700, make the bins 150 wide

Here is the histogram of AirPassengers:

Here is the histogram of AirPassengers:
How to Make a Histogram with Basic R – (Image Courtesy  r-bloggers)

Please note that this is the first blog tranche in a list of 3 posts on creating histograms using R programming.

For more information regarding R language training and other interesting news and articles follow our regular uploads at all our channels. 

 
This post originally appeared onwww.r-bloggers.com/how-to-make-a-histogram-with-basic-r
 

Interested in a career in Data Analyst?

To learn more about Machine Learning Using Python and Spark – click here.

To learn more about Data Analyst with Advanced excel course – click here.
To learn more about Data Analyst with SAS Course – click here.
To learn more about Data Analyst with R Course – click here.
To learn more about Big Data Course – click here.

We Are Providing Corporate Trainings To Jones Lang Lasalle Pvt. Ltd.

We are happy to announce that we have recently begun our corporate training sessions for the multinational company Jones Lang LaSalle Pvt. Ltd. on Tableau Business Intelligence.

 
DexLab Analytics has started their Corporate Training sessions on Tableau BI for Jones Lang LaSalle Pvt. Ltd.
 

The company JLL are a professional service where they deal with investment management firm offering specialization in real estate services for clients seeking increase in value by owning, investing and occupying real estate. It is a Fortune 500 company and gathers annual revenue of USD 5.2 billion with gross revenue of USD 6.0 billion. They have more than 280 corporate offices and operate in 80 countries or more, their global work force is of 60,000. The firm provides management and real estate outsourcing services for realty portfolio for its clients. Their portfolios consists of 4.0 billion square feet which is 372 million square meters and have completed USD 138 billion in sales, financial transactions, and acquisitions on 2015.     Continue reading “We Are Providing Corporate Trainings To Jones Lang Lasalle Pvt. Ltd.”

In Justice Numbers Speak Louder Than Words!

In Indian prisons two thirds of the prisoners are under-trials!

 

 

This Monday, the “Prison Statistics India 2015 report” was released by the National Crime Records Bureau (NCRB). And here are the five surprising things that we gathered from the data about the condition of prisons in India.  Continue reading “In Justice Numbers Speak Louder Than Words!”

Why Getting a Big Data Certification Will Benefit Your Small Business

Do you know how much data is currently produced globally every year?

 

As per the reports published by IBM, the figures are 2.5 QB (Quintillion Bytes). The numeric representation of the same looks as: 2,500,000,000,000,000,000. And we thought that our mobile devices with 64GB memory space are capable of storing huge data.

 

Why Getting a Big Data Certification Will Benefit Your Small Business

Increasing reliance on Big Data

As technology is expanding at the speed next to light, more companies are planning to invest in Big Data platforms for getting the best out of it. Gartner Inc. had conducted a research recently among 437 global organisations across different industries and figured out that more than 75% of them are looking forward to the benefits they can derive from Big Data. The purpose for using Big Data varied to some instance across these organisations, however most of the companies were found to use data analytics for enhancing their customer service segments. Recently, security breach has hit the headline more often than global warming and that has been a factor of worry for many data driven companies. Thus, they are opting for Big Data tools in order to strengthen their online security. Continue reading “Why Getting a Big Data Certification Will Benefit Your Small Business”

Call us to know more