SAS Predictive Modelling Archives - Page 2 of 4 - DexLab Analytics | Big Data Hadoop SAS R Analytics Predictive Modeling & Excel VBA

## The ABC of Summary Statistics and T Tests in SAS

Getting introduced to statistics for SAS training? Then, you must know how to create summary statistics (such as sample size, mean, and standard deviation) to test hypotheses and to figure confidence intervals. In this blog, we will show you how to furnish summary statistics (instead of raw data) to PROC TTEST in SAS, how to develop a data set that includes summary statistics and how to run PROC TTEST to calculate a two-sample or one-sample t test for the mean.

So, let’s start!

#### Running a two-sample t test for difference of means from summarized statistics

Instead of going the clichéd way, we will start with establishing a comparison between the mean heights of 19 students, based on gender – the data is held in the Sashelp class data set.

Observe the below SAS statements that sorts the data by the grouping variable, calling PROC MEANS and printing a subset of the statistics:

```proc sort data=sashelp.class out=class; by sex; /* sort by group variable */ run; proc means data=class noprint; /* compute summary statistics by group */ by sex; /* group variable */ var height; /* analysis variable */ output out=SummaryStats; /* write statistics to data set */ run; proc print data=SummaryStats label noobs; where _STAT_ in ("N", "MEAN", "STD"); var Sex _STAT_ Height; run;```

The table reflects the structure of the Summary Stats set for two sample tests. The two samples used here are differentiated on the levels of the Sex Variable (‘F’ for females and ‘M’ for males). The _STAT_ column shows the name of the statistic implemented here. The Height column depicts the value of the statistics for individual group.

Get SAS certification Delhi from DexLab Analytics today!

The problem: The heights of sixth-grade students are normally distributed. Random samples of n1=9 females and n2=10 males are selected. The mean height of the female sample is m1=60.5889 with a standard deviation of s1=5.0183. The mean height of the male sample is m2=63.9100 with a standard deviation of s2=4.9379. Is there evidence that the mean height of sixth-grade students depends on gender?

Here, you have to do nothing special to get the PROC TTEST – whenever the procedure gets the sight of the respective variable _STAT_ and any unique values, the procedure understands that the data set comprises summarized statistics. The following representation compares the mean heights of males and females:

```proc ttest data=SummaryStats order=data alpha=0.05 test=diff sides=2; /* two-sided test of diff between group means */ class sex; var height; run;```

Check the confidence intervals for the standard deviations and also that the output includes 95% confidence intervals for group means.

In the second table, the ‘Pooled’ row radiates out the impression that both the variances of two groups are more or less equal, which is somewhat true even. The value of the t statistic is t = -1.45 with a two-sided p-value of 0.1645.

The syntax for the PROC TTEST statement allows you to change the type of hypothesis test and the significance level. To support this, you can now run a one-sided test for the alternative hypothesis μ1 < μ2 at the 0.10 significance level just by using:

`proc ttest ... alpha=0.10 test=diff sides=L; /* Left-tailed test */`

#### Running a one-sample t test of the mean from summarized statistics

In the above section, you have learnt to create the summary statistics from PROC MEANS. Nevertheless, you can also generate the summary statistic manually, if you lack original data.

The problem: A research study measured the pulse rates of 57 college men and found a mean pulse rate of 70.4211 beats per minute with a standard deviation of 9.9480 beats per minute. Researchers want to know if the mean pulse rate for all college men is different from the current standard of 72 beats per minute.

The following statements jots down the summary statistics for a data set, asks PROC TTEST to perform a one-sample test of the null hypothesis μ = 72 against a two-sided alternative hypothesis:

```data SummaryStats; infile datalines dsd truncover; input _STAT_:\$8. X; datalines; N, 57 MEAN, 70.4211 STD, 9.9480 ;   proc ttest data=SummaryStats alpha=0.05 H0=72 sides=2; /* H0: mu=72 vs two-sided alternative */ var X; run;```

The outcome is a 95% confidence interval for the mean containing a value 72. The value of the t statistic is t = -1.20, which corresponds to a p-value of 0.2359. Therefore, the data fails in rejecting the null hypothesis at the 0.05 significance level.

For more informative blogs and news about SAS course, drop by our prime SAS predictive modeling training institute DexLab Analytics.

This post originally appeared onblogs.sas.com/content/iml/2017/07/03/summary-statistics-t-tests-sas.html

## Predictive Analytics: In conversation with Adam Bataran, Managing Director of GTM Global Salesforce Platforms at Bluewolf

To discuss about Predictive Analytics, we have Adam Bataran, Managing Director of GTM Global Salesforce Platforms at Bluewolf with us.

Follow the answers Mr. Bataran pitches to understand the entire concept better.

The question: What does predictive analytics mean and what value it imparts to the businesses today?

The answer: Predictive Analytics functions by implementing data, machine learning techniques and statistical algorithms to predict the future business outcomes and trends, based on past data and figures. It involves a number of distinct but advanced analytics disciplines and technologies – from deep data mining techniques and statistical analysis to predictive modeling and machine learning answers the most sought after question, “what will happen next?” or “how the customers will react to this?”.

## New and Improved Data Pane in SAS Visual Analytics Now Goes Painless

It seems some good news is waiting for you – honing your data for effective reports are easier now with the 8.1 release of SAS Visual Analytics. In this technical blog, we will understand the structure of data pane, how it exhibits data from an active data source, and a handful number of tasks, which you might want to perform – like viewing measure details, adjusting data item properties and fabricating geographic data items, custom categories and hierarchies.

## How Predictive Analysis Could Have Saved the World from Ransomware

Kudos to you, if you have stayed offline for the last couple of days, so you could actually spend the weekend well with your family and loved ones. The world is reeling under the shattering news surrounding WannaCry Ransomware this weekend. The situation was worse on Monday, after the offices opened. Going by the figures, revealed out on Monday evening by Elliptic, a Bitcoin forensics firm, which is keeping a watch overall – \$57,282.23 in ransom has been shelled out to the hackers of Ransomware malware attack, who took over hundreds and thousands of computers worldwide on Friday and through the weekend.

## Trends to Watch Out – Global Self-service Business Intelligence (BI) Market 2017

Gartner says – By 2020, the global BI and Analytics market is expected to flourish to USD 22.8 billion.

The Global Self-Service Business Intelligence (BI) Market Research Report 2017 provides a comprehensive, detailed analysis of Self-Service BI industry, including the present Self-Service BI market trends and norms. It mainly focuses on the market of big continents, like North America, Europe and Asia, coupled with countries like Germany, US, China and Japan.

## How to Determine the Size of a SAS Data Set

When program codes, applications and SAS data sets are developed, enough attention is often not given to EFFECIENCY, especially during the initial phases of development. Since, data size and system conduct can influence a program or an application’s functioning, SAS users need to access information about a data set’s size and content. To ascertain how much disk space a data set is using, users can easily do a few calculations to learn to access metadata content and attain the important information. Determine, estimate and understand information with this following tip, which helps improve SAS performance and fine-tuning of techniques.

#### Implementing PROC SQL and DICTIONARY.TABLES

The SAS system accumulates valuable information (also known as metadata) about all-familiar SAS libraries, indexes, data sets (tables), system options, views, catalogs, macros and an assemblage of other “read-only” tables called Dictionary tables and SASHELP views. TABLES, a particular Dictionary table and its SASHELP view equivalent, VTABLE, consists details about a SAS session’s data set. Check the following PROC SQL code as its specification will help us get access to the contents of four columns observed in the TABLES Dictionary table, namely BNAME, MEMNAME, MEMTYPE and FILESIZE to exhibit the size of the CARS data set.

#### PROC SQL and Dictionary.TABLES:

```PROC SQL ;
TITLE ‘Filesize for CARS Data Set’ ;
SELECT LIBNAME,
MEMNAME,
FILESIZE FORMAT=SIZEKMG.,
FILESIZE FORMAT=SIZEK.
FROM DICTIONARY.TABLES
WHERE LIBNAME = ‘SASHELP’
AND MEMNAME = ‘CARS’
AND MEMTYPE = ‘DATA’ ;
QUIT ;```

#### Analysis

The above results show that the CARS data set filesize is 192KB.

Nota bene: If the SIZEKMG.format is mentioned in a format=option, SAS ascertains whether it should apply KB for kilobytes, MB for megabytes or GB for gigabytes, and divide the filesize value with the help of one of the following values:

KB           1024

MB          1048576

GB           1073741824

#### Using PROC PRINT and SASHELP.VTABLE

In the following example, the provisions of a PROC PRINT are explained to access the constituents of three columns found in the VTABLE SASHELP view, particularly LIBNAME, MEMNAME and FILESIZE to exhibit the size of the CARS data set.

#### PROC PRINT and SASHELP.VTABLE

```PROC PRINT DATA=SASHELP.VTABLE NOOBS ;
VAR LIBNAME MEMNAME FILESIZE ;
WHERE LIBNAME = ‘SASHELP’
AND MEMNAME = ‘CARS’ ;
FORMAT FILESIZE SIZEKMG. ;
TITLE ‘Filesize for SASHELP.CARS Data Set’ ;
RUN ;```

#### Using DATA _NULL_, SASHELP.VEXTFL and CALL SYMPUTX

Lastly, a DATA_NULL_ is depicted to approach the contents of the VEXTFL SASHELP view with a FILENAME statement. An assignment statement is specified to determine the FILESIZE value for the size of the CARS data set. The CALL SYMPUTX left supports and chops off the trailing blanks from the digital FILSESIZE value of 196608.

#### DATA_NULL_and SASHELP.VEXTFL

```filename myfile 'C:\Program Files\SAS9.4\SASFoundation\9.4\\CORE\SASHELP\Cars.sas7bdat' ;
DATA _NULL_ ;
SET SASHELP.VEXTFL (WHERE=(FILEREF=’MYFILE’)) ;
/* Calculate the Filesize in MB */
FILESIZE = FILESIZE / (1024 ** 2) ;
CALL SYMPUTX (‘FILESIZE’,FILESIZE) ;
RUN ;```

#### Results

Learn more about SAS Predictive Modelling by taking up SAS certification courses in Delhi and Gurgaon. DexLab Analytics offers excellent SAS analytics course for data enthusiasts.

This post originally appeared onblogs.sas.com/content/sastraining/2017/04/25/determining-the-size-of-a-sas-data-set

#### Interested in a career in Data Analyst?

As per the latest research from strategy analytics, the global smart watch shipments of Apple has grown by 1 percent annually to hit the major record of 8.2 million units in the 4th quarter of the year 2016. The growth of apple watch drove and got dominated with 63 percent in global smart watch share of market and Samsung still continues to hold its second position.

Neil Mawston, the Executive Director at Strategy Analytics stated on the issue by saying – the global shipments have grown by 1 percent annually from the pre-existing 8.1 million units in quarter 4 in 2015 to 8.2 million in quarter 4 in 2016. The market shows a marked growth in the fourth quarter for growth in smart watches industry after the past two consecutive quarters for declining volumes. The smart watch growth is also seen to be recovering ever so slightly due to new product launches from other company giants. Moreover, there is a seasonal demand for these gadgets, and a giant such as Apple is launching stringer demand in the major developed markets in the US and UK. Hence, the international smart watch shipments grew by 1 percent annually; from the previously existing 20.8 million in full-year 2015 to a record high of 21.1 million in 2016.

## How to Simulate Multiple Samples From a Linear Regression Model

In this blog post, we will learn how to simulate multiple samples efficiently. In order to keep the discussion, easy we have simulated a single sample with ‘n’ number of observations, and ‘p’ amount of variables. But in order to use the Monte Carlo method to approximate the distribution sampling of statistics, one needs to simulate many specimens with the same regression model.

The data steps in SAS in  most blogs have 4 steps mentioned for so. However, to simulate multiple samples, put DO loop around these steps that will generate, the error term and the response variable for very observation made in the model.

+91 931 572 5902