Big data training Archives - Page 2 of 8 - DexLab Analytics | Big Data Hadoop SAS R Analytics Predictive Modeling & Excel VBA

## Rudiments of Hierarchical Clustering: Ward’s Method and Divisive Clustering

Clustering, a process used for organizing objects into groups called clusters, has wide ranging applications in day to day life, including fields like marketing, city-planning and scientific research.

Hierarchical clustering, one the most common methods of clustering, builds a hierarchy of clusters either by a ‘’bottom up’’ approach (Agglomerative clustering) or by a ‘’top down’’ approach (Divisive clustering). In the previous blogs, we have discussed the various distance measures and how to perform Agglomerative clustering using linkage types. Today, we will explain the Ward’s method and then move on to Divisive clustering.

#### Ward’s method:

This is a special type of agglomerative hierarchical clustering technique that was introduced by Ward in 1963. Unlike linkage method, Ward’s method doesn’t define distance between clusters and is used to generate clusters that have minimum within-cluster variance. Instead of using distance metrics it approaches clustering as an analysis of variance problem. The method is based on the error sum of squares (ESS) defined for jth cluster as the sum of the squared Euclidean distances from points to the cluster mean.

Where Xij is the ith observation in the jth cluster. The error sum of squares for all clusters is the sum of the ESSj values from all clusters, that is,

Where k is the number of clusters.

The algorithm starts with each observation forming its own one-element cluster for a total of n clusters, where n is the number of observations. The mean of each of these on-element clusters is equal to that one observation. In the first stage of the algorithm, two elements are merged into one cluster in a way that ESS (error sum of squares) increases by the smallest amount possible. One way of achieving this is merging the two nearest observations in the dataset.

Up to this point, the Ward algorithm gives the same result as any of the three linkage methods discussed in the previous blog. However, as each stage progresses we see that the merging results in the smallest increase in ESS.

This minimizes the distance between the observations and the centers of the clusters. The process is carried on until all the observations are in a single cluster.

#### Divisive clustering:

Divisive clustering is a ‘’top down’’ approach in hierarchical clustering where all observations start in one cluster and splits are performed recursively as one moves down the hierarchy. Let’s consider an example to understand the procedure.

Consider the distance matrix given below. First of all, the Minimum Spanning Tree (MST) needs to be calculated for this matrix.

The MST Graph obtained is shown below.

The subsequent steps for performing divisive clustering are given below:

Cut edges from MST graph from largest to smallest repeatedly.

Step 1: All the items are in one cluster- {A, B, C, D, E}

Step 2: Largest edge is between D and E, so we cut it in 2 clusters- {E}, {A., B, C, D}

Step 3: Next, we remove the edge between B and C, which results in- {E}, {A, B} {C, D}

Step 4: Finally, we remove the edges between A and B (and between C and D), which results in- {E}, {A}, {B}, {C} and {D}

Hierarchical clustering is easy to implement and outputs a hierarchy, which is structured and informative. One can easily figure out the number of clusters by looking at the dendogram.

However, there are some disadvantages of hierarchical clustering. For example, it is not possible to undo the previous step or move around the observations once they have been assigned to a cluster. It is a time-consuming process, hence not suitable for large datasets. Moreover, this method of clustering is very sensitive to outlietrs and the ordering of data effects the final results.

In the following blog, we shall explain how to implement hierarchical clustering in R programming with examples. So, stay tuned and follow DexLab Analytics – a premium Big Data Hadoop training institute in Gurgaon. To aid your big data dreams, we are offering flat 10% discount on our big data Hadoop courses. Enroll now!

#### Check back for our previous blogs on clustering:

Hierarchical Clustering: Foundational Concepts and Example of Agglomerative Clustering

## How Hadoop Data Lake Architecture Helps Data Integration

New data objects, like data planes, data streaming and data fabrics are gaining importance these days. However, let’s not forget the shiny data object from a few years back-Hadoop data lake architecture. Real-life Hadoop data lakes are the foundation around which many companies aim to develop better predictive analytics and sometimes even artificial intelligence.

This was the crux of discussions that took place in Hortonworks’ DataWorks Summit 2018. Here, bigwigs like Sudhir Menon shared the story behind every company wanting to use their data stores to enable digital transformation, as is the case in tech startups, like Airbnb and Uber.

In the story shared by Menon, vice president of enterprise information management at hotelier Hilton Worldwide, Hadoop data lake architecture plays a key role. He said that all the information available in different formats across different channels is being integrated into the data lake.

The Hadoop data lake architecture forms the core of a would-be consumer application that enables Hilton Honors program guests to check into their rooms directly.

#### A time-taking procedure:

Menon stated that the Hadoop data lake project, which began around two years back, is progressing rapidly and will start functioning soon. However, it is a ‘’multiyear project’’. The project aims to buildout the Hadoop-based Hotonworks Data Platform (HDP) into a fresh warehouse for enterprise data in a step-by-step manner.

The system makes use of a number of advanced tools, including WSO2 API management, Talend integration and Amazon Redshift cloud data warehouse software. It also employs microservices architecture to transform an assortment of ingested data into JSON events. These transformations are the primary steps in the process of refining the data. The experience of data lake users shows that the data needs to be organized immediately so that business analysts can work with the data on BI tools.

This project also provides a platform for smarter data reporting on the daily. Hilton has replaced 380 dashboards with 40 compact dashboards.

For companies like Hilton that have years of legacy data, shifting to Hadoop data lake architectures can take a good deal of effort.

Another data lake project is in progress at United Airlines, a Chicago-based airline. Their senior manager for big data analytics, Joe Olson spoke about the move to adopt a fresh big data analytics environment that incorporates data lake and a ‘’curated layer of data.’’ Then again, he also pointed out that the process of handling large data needs to be more efficient. A lot of work is required to connect Teradata data analytics warehouse with Hortonworks’ platform.

Difference in file sizes in Hadoop data lakes and single-client implementations may lead to problems related to garbage collection and can hamper the performance.

Despite these implementation problems, the Hadoop platform has fueled various advances in analytics. This has been brought about by the evolution of Verizon Wireless that can now handle bigger and diverse data sets.

In fact, companies now want the data lake platforms to encompass more than Hadoop. The future systems will be ‘’hybrids of on-premises and public cloud systems and, eventually, will be on multiple clouds,’’ said Doug Henschen, an analyst at Constellation Research.

Large companies are very much dependent on Hadoop for efficiently managing their data. Understandably, the job prospects in this field are also multiplying.

Are you a big data aspirant? Then you must enroll for big data Hadoop training in Gurgaon. At Dexlab industry-experts guide you through theoretical as well as practical knowledge on the subject. To help your endeavors, we have started a new admission drive #BigDataIngestion. All students get a flat discount on big data Hadoop certification courses. To know more, visit our website.

## Industry Use Cases of Big Data Hadoop Using Python – Explained

Welcome to the BIG world of Big Data Hadoop – the encompassing eco-system of all open-source projects and procedures that constructs a formidable framework to manage data. Put simply, Hadoop is the bedrock of big data operations. Though the entire framework is written in Java language, it doesn’t exclude other programming languages, such as Python and C++ from being used to code intricate distributed storage and processing framework. Besides Java architects, Python-skilled data scientists can also work on Hadoop framework, write programs and perform analysis. Easily, programs can be written in Python language without the need to translate them into Java jar files.

Python as a programming language is simple, easy to understand and flexible. It is capable and powerful enough to run end-to-end advanced analytical applications. Not to mention, Python is a versatile language and here we present a few popular Python frameworks in sync with Hadoop:

• Dumbo
• Mrjob
• Pydoop

Now, let’s take a look at how some of the top notch global companies are using Hadoop in association with Python and are bearing fruits!

#### Amazon

Based on the consumer research and buying pattern, Amazon recommends suitable products to the existing users. This is done by a robust machine learning engine powered by Python, which seamlessly interacts with Hadoop ecosystem, aiding in delivering top of the line product recommendation system and boosting fault tolerant database interactions.

In the domain of image processing, Facebook is second to none. Each day, Facebook processes millions and millions of images based on unstructured data – for that Facebook had to enable HDFS; it helps store and extract enormous volumes of data, while using Python as the backend language to perform a large chunk of its Image Processing applications, including Facial Image Extraction, Image Resizing, etc.

Rightfully so, Facebook relies on Python for all its image related applications and simulates Hadoop Streaming API for better accessibility and editing of data.

#### Quora Search Algorithm

Quora’s backend is constructed on Python; hence it’s the language used for interaction with HDFS. Also, Quora needs to manage vast amounts of textual data, thanks to Hadoop, Apache Spark and a few other data-warehousing technologies! Quora uses the power of Hadoop coupled with Python to drag out questions from searches or for suggestions.

#### End Notes

The use of Python is varied; being dynamically typed, portable, extendable and scalable, Python has become a popular choice for big data analysts specializing in Hadoop. Mentioned below are a couple of other notable industries where use cases of Hadoop using Python are to be found:

• YouTube uses a recommendation engine built using Python and Apache Spark.
• Limeroad functions on an integrated Hadoop, Apache Spark and Python recommendation system to retain online visitors through a proper, well-devised search pattern.
• Iconic animation companies, like Disney depend on Python and Hadoop; they help manage frameworks for image processing and CGI rendering.

Now, you need to start thinking about arming yourself with big data hadoop certification course – these big data courses are quite in demand now – as it’s expected that the big data and business analytics market will increase from \$130.1 billion to more than \$203 billion by 2020.

## Why Portability is Gaining Momentum in the Field of Data

Ease and portability are of prime importance to businesses. Companies want to handle data in real-time; so there’s need for quick and smooth access to data. Accessibility is often the deciding factor that determines if a business will be ahead or behind in competition.

Data portability is a concept that is aimed at protecting users by making data available in a structured, machine-readable and interoperable format. It enables users to move their data from one controller to another. Organizations are required to follow common technical standards to assist transfer of data instead of storing data in ‘’walled gardens’’ that renders the data incompatible with other platforms.

Now, let’s look a little closer into why portability is so important.

Making data portable gives consumers the power to access data across multiple channels and platforms. It improves data transparency as individuals can look up and analyze relevant data from different companies. It will also help people to exercise their data rights and find out what information organizations are holding. Individuals will be able to make better queries.

From keeping a track of travel distance to monitoring energy consumption on the move, portable data is able to connect with various activities and is excellent for performing analytical examinations on. Portable data may be used by businesses to map consumers better and help them make better decisions, all the while collecting data very transparently. Thus, it improves data personalization.

For example, the portable data relating to a consumers grocery purchases in the past can be utilized by a grocery store to provide useful sales offers and recipes. Portable data can help doctors find quick information about a patient’s medical history- blood group, diet, regular activities and habits, etc., which will benefit the treatment. Hence, data portability can enhance our lifestyle in many ways.

#### Struggles:

Portable data presents a plethora of benefits for users in terms of data transparency and consumer satisfaction. However, it does have its own set of limitations too. The downside of greater transparency is security issues. It permits third parties to regularly access password protected sites and request login details from users. Scary as it may sound; people who use the same password for multiple sites are easy targets for hackers and identity thieves. They can easily access the entire digital activity of such users.

Although GDPR stipulates that data should be in a common format, that alone doesn’t secure standardization across all platforms. For example, one business may name a domain ‘’Location” while another business might call the same field ‘’Locale”.  In such cases, if the data needs to be aligned with other data sources, it has to be done manually.

According to GDPR rules, if an organization receives a request pertaining to data portability, then it has to respond within one month. While they might be willing to readily give out data to general consumers, they might hold off the same information if they perceive the request as competition.

#### Future:

Data portability runs the risk of placing unequal power in the hands of big companies who have the money power to automate data requests, set up an entire department to cater to portability requests and pay GDPR fines if needed.

Despite these issues, there are many positives. It can help track a patient’s medical statistics and provide valuable insights about the treatment; and encourage people to donate data for good causes, like research.

As businesses as well as consumers weigh the pros and cons of data portability, one thing is clear- it will be an important topic of discussion in the years to come.

Businesses consider data to be their most important asset. As the accumulation, access and analysis of data is gaining importance, the prospects for data professionals are also increasing. You must seize these lucrative career opportunities by enrolling for Big Data Hadoop certification courses in Gurgaon. We at Dexlab Analytics bring together years of industry experience, hands-on training and a comprehensive course structure to help you become industry-ready.

Don’t miss the summer special course discounts on big data Hadoop training in Delhi. We are offering flat 10% discount to all interested students. Hurry!

## For a Seamless, Real-Time Integration and Access across Multiple Data Siloes, Big Data Fabric Is the Solution

Grappling with diverse data?

No worries, data fabrics for big data is right here.

The very notion of a fabric joining computing resources and offering centralized access to a set of networks has been doing rounds since the conceptualization of grid computing as early as 1990s. However, a data fabric is a relatively new concept based on the same underlying principle, but it’s associated with data instead of a system.

As data have become increasingly diversified, the importance of data fabrics too spiked up. Now, integrating such vast pools of data is quite a problem, as data collected across various channels and operations is often withhold in discrete silos. The responsibility lies within the enterprise to bring together transactional data stores, data lakes, warehouses, unstructured data sources, social media storage, machine logs, application storage and cloud storage for management and control.

#### The Change That Big Data Brings In

The escalating use of unstructured data resulted in significant issues with proper data management. While the accuracy and usability quotient remained more or less the same, the ability to control them has been reduced because of increasing velocity, variety, volume and access requirements of data. To counter the pressing challenge, companies have come with a number of solutions but the need for a centralized data access system prevails – on top of that big data adds concerns regarding data discovery and security that needs to be addressed only through a particular single access mechanism.

To taste success with big data, the enterprises need to seek access to data from a plethora of systems in real time in perfectly digestible formats – also connecting devices, including smartphones and tablets enhances storage related issues. Today, big data storage is abundantly available in Apache Spark, Hadoop and NoSQL databases that are developed with exclusive management demands.

#### The Popularity of Data Fabrics

Huge data and analytics vendors are the biggest providers of big data fabric solutions. They help offer access to all kinds of data and conjoin them into a single consolidated system. This consolidated system – big data fabric – should tackle diverse data stores, nab security issues, offer consistent management through unified APIs and software access, provide auditability, flexibility and be upgradeable and process smooth data ingestion, curation and integration.

With the rise of machine learning and artificial intelligence, the requirements of data stores increase as they form the fundamentals of model training and operations. Therefore, enterprises are always seeking a single platform and a single point for data access, they tend to reduce the intricacies of the system and ensure easy storage of data. Not only that, data scientists no longer need to focus on the complexities of data access, rather they can give their entire attention to problem-solving and decision-making.

To better understand how data fabrics provide a single platform and a single point for data access across myriad siloed systems, you need a top of the line big data certification today. Visit DexLab Analytics for recognized and well-curated big data hadoop courses in Gurgaon.

## Secrets behind the Success of AI Implanted Analytics Processes

Big data combined with machine learning results in a powerful tool. Businesses are using this combination more and more, with many believing that the age of AI has already begun. Machine learning embedded in analytics processes opens new gateways to success, but companies must be careful about how they use this power. Organizations use this powerful platform in various domains, such as fraud detection, boosting cybersecurity and carrying out personalized marketing campaigns.

Machine learning isn’t a technology that simply speeds up the process of solving existing problems, it holds the potential to provide solutions that weren’t even thought of before; boost innovation and identify problem areas that went unnoticed.  To utilize this potent tech the best possible way, companies need to be aware of AI’s strengths as well as limitations. Businesses need to adopt renewed ways of harnessing the power of AI and analytics. Here are the top 4 ways to make the most out of AI and big data.

#### Context is the key:

Sifting through available information, machine learning can provide insights that are compelling and trustworthy. But, it lacks the ability to judge which results are valuable. For example, taking up a query from a garment store owner, it will provide suggestions based on previous sales and demographic information. However, the store owner might see that some of these suggestions are redundant or impractical. Moreover, humans need to program AI so that it takes into account variables and selects relevant data sets to analyze. Hence, context is the key. Business owners need to present the proper context, based on which AI will provide recommendations.

Machine learning can offer a perfect answer to your query. But, it can do much more. It might stun you by providing appropriate solutions to queries you didn’t even ask. For example, if you are trying to convince a customer to take a particular loan, then machine learning can crunch huge data sets and provide a solution. But is drawing more loans your real goal? Or is the bigger goal increasing revenues? If this is your actual goal, then AI might provide amazing solutions, like opening a new branch, which you probably didn’t even think about. In order to elicit such responses, you must broaden the realm of queries so that it covers different responses.

#### Have faith in the process:

AI can often figure things out that it wasn’t trained to understand and we might never comprehend how that happened. This is one of the wonders of AI. For example, Google’s neural network was shown YouTube videos for a few days and it learnt to identify cats, something it wasn’t taught.

Such unprecedented outcomes might be welcome for Google, but most businesses want to trust AI, and for that they seek to know how techs arrive at solutions. The insights provided by machine learning are amazing but businesses can act on them only if they trust the tech. It takes time to trust machines, just like it is with humans. In the beginning we might feel the need to verify outputs, but as the algorithms give good results repeatedly, trust comes naturally.

#### Act sensibly:

Machine learning is a powerful tool that can backfire too. An example of that is the recent misuse of Facebook’s data by Cambridge Analytica, which couldn’t be explained by Facebook authorities too. Companies need to be aware of the consequences of using such an advanced technology. They need to be mindful of how employees use results generated by analytics tools and how third parties handle data that has been shared. All employees don’t need to know that AI is used for inner business processes.

Artificial Intelligence can fuel growth and efficiency for companies, but it takes people to make the best use of it. And how can you take advantage of this data-dominated business world? Enroll for big data Hadoop certification in Gurgaon. As DexLab Analytic’s #BigDataIngestion campaign is ongoing, interested students can enjoy flat 10% discount on big data Hadoop training and data science certifications.

## 7-Step Framework to Ensure Big Data Quality

Ensuring data quality is of paramount importance in today’s data-driven business world because poor quality can render all kinds of data completely useless. Moreover, this data is unreliable and lead to faulty business strategies if analyzed. Data quality is the key to making trustworthy business decisions.

Companies lacking correct data-quality framework are likely to encounter a crisis situation. According to certain reports, big companies are incurring losses of around \$9 million/year due to poor data quality. Back in 2013, US Postal Service spent around \$1.5 billion in processing mails that were undelivered due to bad data quality.

While the sources of poor quality data can be many, including data entry, data processing and stale data, data in motion is the most vulnerable. The moment data enters the systems of an organization it starts to move. There’s a lot of uncertainty about how to monitor moving data, and the existing processes are fragmented and ad-hoc. Data environments are becoming more and more complex, and the volume, variety and speed of big data can be quite overwhelming.

Here, we have listed some essential steps to ensure that your data is consistently of good quality.

• Discover: Systems carrying critical information need to be identified first. For this, source and target system owners must jointly work to discover existing data issues, set quality standards and fix measurement metrics. So, this step ensures that the company has established yardsticks against which data quality of various systems will be measured. However, this isn’t a onetime process, rather it a continuous process that needs to evolve with time.
• Define: it is crucial to clearly define the pain points and potential risks associated with poor data quality. Often, some of these definitions might be relevant to only one particular organization, whereas many times these are associated with regulations of the industry/sector the company belongs to.
• Assessment: Existing data needs to be assessed against different dimensions, such as accuracy, completeness and consistency of key attributes; timeliness of data, etc. Depending upon the data, qualitative or quantitative assessment might be performed. Existing data policies and their adherence to industry guidelines need to be reviewed.
• Measurement Scale: It is important to develop a data measurement scale that can assign numerical values to different attributes. It is better to express definitions using arithmetic values, such as percentages. For example: Instead of categorizing data as good data and bad data, it can be classified as- acceptable data has >95% accuracy.
• Design: Robust management processes need to be designed to address risks identified in the previous steps. The data-quality analysis rules need to apply to all the processes. This is especially important for large data sets, where entire data sets need to be analyzed instead of samples, and in such cases the designed solutions must run on Hadoop.
• Deploy: Set up appropriate controls, with priority given to the most risky data systems. People executing the controls are as important as the technologies behind them.
• Monitor: Once the controls are set up, data quality standards determined in ‘discovery’ phase need to be monitored closely. An automated system is the best for continuous monitoring as it saves both time and money.

Thus, achieving high-quality data requires an all-inclusive platform that continuously monitors data and flags and stops bad data before they can harm business processes. Hadoop is the popular choice for data quality management across the entire enterprise.

#### DexLab Analytics Presents #BigDataIngestion

If you are looking for big data Hadoop certification in Gurgaon, visit Dexlab Analytics. We are offering flat 10% discount on our big data Hadoop training courses in Gurgaon. Interested students all over India must visit our website for more details. Our professional guidance will prove highly beneficial for all those wanting to build a career in the field of big data analytics.

## How Big Data Is Influencing HR Analytics for Employees and Employers, Both

HR analytics powered by big data is aiding talent management and hiring decisions. A Deloitte 2015 report says 35% of companies surveyed revealed that they were actively developing suave data analytics strategies for HR. Moreover, big data analytics isn’t leaving us anytime soon; it’s here to stay for good.

Now, with that coming, employers are of course in an inapt position: whether to use HR analytics or not? And even if they do use the data, how are they going to do that without violating any HR policies/laws or upsetting the employees?

#### Health Data

While most of the employers are concerned about healthcare and wellness programs for their employees, a whole lot of other employees have started employing HR analytics for evaluation of the program’s effectiveness and addressing the gaps in healthcare coverage with an aim to improve overall program performance.

Today, data is the lifeblood of IT services. Adequate pools of employee data in conjunction with company data are aiding discoveries of the best benefit package for employees where they get best but affordable care. However, in the process, the employers need to be very careful and sensitive to employee privacy at the same time. During data analysis, the process should appear as if the entire organization is involved in it, instead of focusing on a single employee or sub-groups.

#### Predictive Performance Analytics

For talent management, HR analytics is a saving grace. Especially, owing to its predictive performance. Because of that, more and more employers are deploying this powerful skill to determine future hiring needs and structure a strong powerhouse of talent.

Rightfully so, predictive performance analytics use internal employee data to calculate potential employee turnover, but unfortunately, in some absurd cases, the same data can also be used to influence decisions regarding firing and promotion – and that becomes a problem.

Cutting edge machine learning algorithms dictate whether an event is going to happen or not, instead of what employees are doing or saying. Though it comes with its own advantages, its better when people frame decisions based on data. Because, people are unpredictable and so are the influencing factors.

#### Burn away irrelevant information

Sometimes, it may happen that employers instead of focusing on the meaningful things end up scrutinizing all the wrong things. For example, HR analytics show that employees living close to the office, geographically, are less likely to leave the office premise early. But, based on this, can we pass off top talent just because they reside a little farther from the office? We can’t, right?!

Hence, the bottom line is, whenever it comes to analyzing data, analysts should always look for the bigger picture rather giving stress on minute features – such as which employee is taking more number of leaves, and so on. Stay ahead of the curve by making the most productive decisions for employees as well as business, as a whole.

In the end, the power of data matters. HR analytics help guide the best decisions, but it’s us who are going to make them. We shouldn’t forget that. Use big data analytics responsibly to prevent any kind of mistrust or legal issues from the side of employees, and deploy them in coordination with employee feedback to come at the best conclusions ever.

Those who are inclined towards big data hadoop certification, we’ve some droolworthy news for you! DexLab Analytics, a prominent data science learning platform has launched a new admission drive: #BigDataIngestion on in-demand skills: data science and big data with exclusive 10% discounts for all students. This summer, unfurl your career’s wings of success with DexLab Analytics!

#### Get the details here : www.dexlabanalytics.com/events/dexlab-analytics-presents-bigdataingestion

Reference:

The article has been sourced from https://www.entrepreneur.com/article/271753

## A Comprehensive Guide on Clustering and Its Different Methods

Clustering is used to make sense of large volumes of data, structured or unstructured, by dividing the data into groups. The members of a group are ‘’similar’’ between them and ‘’dissimilar’’ to objects in other groups. The similarity is based on characteristics such as equal distances from a point or people who read the same genre of book. These groups with similar members are called clusters. The various methods of clustering, which we shall be discussing subsequently, help break up data into logical groupings before analyzing the data more deeply.

If a CEO of a company presents a broad question like- ‘’ Help me understand our customers better so that we can improve marketing strategies’’, then the first thing analysts need to do is use clustering methods to the classify customers. Clustering has plenty of application in our daily lives. Some of the domains where clustering is used are:

• Marketing: Used to group customers having similar interests or showing identical behavior from large databases of customer data, which contain information on their past buying activities and properties.
• Libraries: Used to organize books.
• Biology: Used to classify flora and fauna based on their features.
• Medical science: Used for the classification of various diseases.
• City-planning: identifying and grouping houses based on house type, value and geographical location.
• Earthquake studies: clustering existing earthquake epicenters to locate dangerous zones.

Clustering can be performed by various methods, as shown in the diagram below:

Fig 1

The two major techniques used to perform clustering are:

• Hierarchical Clustering: Hierarchical clustering seeks to develop a hierarchy of clusters. The two main techniques used for hierarchical clustering are:
1. Agglomerative: This is a ‘’bottom up’’ approach where first each observation is assigned a cluster of its own, then pairs of clusters are merged as one moves up the hierarchy. The process terminates when only a single cluster is left.
2. Divisive: This is a ‘’top down’’ approach wherein all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. The process terminates when each observation has been assigned a separate cluster.

Fig 2: Agglomerative clustering follows a bottom-up approach while divisive clustering follows a top-down approach.

• Partitional Clustering: In partitional clustering a set of observations is divided into non-overlapping subsets, such that each observation is in exactly one subset. The main partitional clustering method is K-Means Clustering.

The most popular metric used for forming clusters or deciding the closeness of clusters is distance. There are various distance measures. All observations are measured using one particular distance measure and the observation having the minimum distance from a cluster is assigned to it. The different distance measures are:

• Euclidean Distance: This is the most common distance measure of all. It is given by the formula:

Distance((x, y), (a, b)) = √(x – a)² + (y – b)²

For example, the Euclidean distance between points (2, -1) and (-2, 2) is found to be

Distance((2, -1), (-2, 2))

• Manhattan Distance:

This gives the distance between two points measured along axes at right angles. In a plane with p1 at (x1, y1) and p2 at (x2, y2), Manhattan distance is |x1 – x2| + |y1 – y2|.

• Hamming Distance:

Hamming distance between two vectors is the number of bits we must change to convert one into the other. For example, to find the distance between vectors 01101010 and 11011011, we observe that they differ in 4 places. So, the Hamming distance d(01101010, 11011011) = 4

• Minkowski Distance:

The Minkowski distance between two variables X and Y is defined as

The case where p = 1 is equivalent to the Manhattan distance and the case where p = 2 is equivalent to the Euclidean distance.

These distance measures are used to measure the closeness of clusters in hierarchical clustering.

In the next blogs, we will discuss the different methods of clustering in more details, so make sure you follow DexLab Analytics– we provide the best big data Hadoop certification in Gurgaon. Do check our data analyst courses in Gurgaon.