Big Data Archives - Page 17 of 17 - DexLab Analytics | Big Data Hadoop SAS R Analytics Predictive Modeling & Excel VBA

How Hadoop makes Optimum Use of Distributed Storage and Parallel Computing

Hadoop is java based open source framework by Apache Software Foundation, It works on the principle of distributed storage and parallel computing for large datasets on commodity hardware.

Let’s take few core concepts of Hadoop in detail :

Distributed Storage – Here in Hadoop we deal with files of size TB or may be PB. We divide each file into parts and store them on multiple machines. It replicates each file by default 3 times (you can change replication factor as per your requirement) , 3 copies of each file minimizes the risk of data loss in Hadoop Eco system. In real life as you store a copy of car key at home to avoid problem in case your keys are lost

How Hadoop makes Optimum Use of Distributed Storage and Parallel Computing

Parallel Processing – We have progressed a lot in terms of storage space, processing power of processers but seek time of hard disk has not improved significantly to overcome this issue in Hadoop to read a file of 1 TB would take a long time by storing this file on 10 machines on a cluster, we can reduce seek time by upto 10 times.
HDFS has a minimum block size of 64MB to store large files in an optimized manner.

Let me explain you with some calculations:
Traditional System Hadoop System (HDFS)
File Size – 1TB (1000000000 KB) 1TB (1000000000 KB)
Windows Block Size – 8KB 64MB
No. of Blocks = 125000000 (1000000000 /8) 15625 (1000000000 /64000)
Assuming avg seek time = 4ms 4ms
Total Seek Time =125000000* 4 15625 * 4
= 500000000ms =62500ms
As you can see due to HDFS Block size of 64MB we could save 499937500ms (i.e. 99.98%of seek time) while reading 1TB of file in comparison to windows system.

We could further reduce seek time by dividing file into n parts and saving them on n no. of machines then seek time for 1TB file would be 62500/n ms.

Here you can see one use of parallel processing i.e. parallel reading of a file in cluster on multiple machines.
Parallel processing is a concept on which Map Reduce paradigm work in Hadoop, it distributes a job into multiple tasks for processing as a Map Reduce job more details in coming up blog for Map Reduce.

Commodity Hardware – It is the usual hardware that you use as your laptops / desktops in place of High Availability reliable IBM Machines. The use of commodity hardware has helped business hubs to save a lot of infrastructure cost. Commodity hardware is approx. 60% cheaper than High Availability reliable machine.

Big Data- Down to the Tidbits

Any data difficult to process or store on conventional systems of computational power and storage ability in real time is better known as Big Data. In our times the growth of data to be stored is exponential and so are its sources in terms of numbers.

Big Data has some other distinguishing features which are also popularly known as the six V’s of Big Data and they are in no particular order:

  • Variable: In order o illustrate the variable nature of Big Data we may illustrate the same through an analogy. A single item ordered from a restaurant may taste differently at different times. Variability of Big Data refers to the context as similar text may have different meanings depending on the context. This remains a long-standing challenge for algorithms to figure out and to differentiate between meanings according to context.
  • Volume: The volume of data as it grows exponentially in today’s times presents the biggest hurdle faced by traditional means of systems for processing as well as storage. This growth remains very high and is usually measured in petabytes or thousands of terabytes.
  • Velocity: The data generated in real time by logs, sensors is sensitive towards time and is being generated at high rates. These need to be worked upon in real time so that decisions may be made as and when necessary. In order to illustrate we may cite instances where particular credit card transactions are assessed in real time and decided accordingly. The banking industry is able to better understand consumer patterns and make safer more informed choices on transactions with the help of Big Data.

Big Data & Analytics DexLab Analytics

  • Volatile: Another factor to keep in mind while dealing with Big Data is how long the particular data remains valid and is useful enough to be stored. This is borne out by necessity of data importance. A practical example might be like a bank might feel that particular data is not useful on the credibility of a particular holder of credit cards. It is imperative that business is not lost while trying to avoid poor business propositions.
  • Variety: The variety of data makes reference to the varied sources of data and whether it is structured or not. Data might come from a variety of formats such as Videos, Images, XML files or Logs. It is difficult to analyze as well as store unstructured data in traditional systems of computing.

Most of the major organizations that are found in the various parts of the world are now on the lookout to manage, store and process their Big Data in more economical and feasible platforms so that effective analysis and decision-making may be made.

Big Data Hadoop from Apache is the current market leader and allows for a smooth transition. However with the rise of Big Data, there has been a marked increase in the demand for trained professionals in this area who have the ability to develop applications on Big Data Hadoop or create new data architectures. The distributed model of storage and processing as pursued by Hadoop gives it a greater advantage over conventional database management systems.

THE BIGGER THE BETTER – BIG DATA

One fine day people realized that it is raining gems and diamonds from the sky and they start looking for a huge container to collect and store it all, but even the biggest physical container is not enough since it is raining everywhere and every time, no one can have all of it alone, so they decide to just collect it in their regular containers and then share and use it.

Since the last few years, and more with the introduction of hand-held devices, valuable data is being generated all around us. Right from health care companies, weather information of the entire world, data from GPS, telecommunication, stock exchange, financial data, data from the satellites, aircrafts to the social networking sites which are a rage these days we are almost generating 1.35 million GB of data every minute. This huge amount of valuable, variety data being generated at a very high speed is termed as “Big Data”.

 

 

This data is of interest to many companies, as it provides statistical advantage in predicting the sales, health epidemic predictions, climatic changes, economic forecasts etc. With the help of Big Data, the health care providers, are able to detect an outbreak of flu, just by number of people in the geography writing on the social media sites “not feeling well.. down with cold !”.

Big data was used to locate the missing Malaysian flight “MH370”. It was Big Data that helped analyze the million responses and the impact of the very famous TV show “Satyamev Jayate”. Big data techniques are being used in neonatal units, to analyze and record the breathing pattern and heartbeats of babies to predict infections even before the symptoms appear.

As they say, when you have a really big hammer, everything becomes a nail. There is not a single field where big data does not give you the edge, however processing of this massive amount of data is a challenge and hence the need of a framework that could store and process data in a distributed manner (the shared regular containers).

Apache Hadoop is an open source framework, developed by Doug Cutting and Mike Cafarella in 2005, written in java for distributed processing and storage of very large data sets on clusters of normal commodity hardware.

It uses data replication for reliability, high speed indexing for faster retrieval of data and is centrally managed by a search server for locating data. Hadoop has HDFS (Hadoop Distributed File System) for the storage of data and MapReduce for parallel processing of this distributed data. To top it all, it is cost effective since it uses commodity hardware only, and is scalable to the extent you require. Hadoop framework is in huge demand by all big companies. It is the handle for the Big hammer!!

Big Data at Autodesk: 360 Degree view of Customers in the Cloud

The last few years have seen a huge paradigm shift for many software vendors. The move away from a product-based model towards software-as-a-service (SAAS) in the cloud has brought huge changes. The main advantage of moving from a product based model to software-as-a-service is that the companies will be able to identify the service usage of how and why a product is being used. Earlier software companies used to run a survey or focus groups of customer feedback to identify the how and why a product is being used. This customer feedback survey has various limitations on identifying the product usage or where the product improvement has to be made.

Here’s All You Need to Know about Quantum Computing and Its Future

Autodesk was one of the frontrunners in the field, having been experimenting with the cloud based SAAS as far back as 2001 when it is acquired the BUZZSAW file sharing and synchronization service. Since then Microsoft, Adobe and many others moving into a subscription based, on-demand service and Autodesk has done the same with its core computer aided design products.

Software-as-a-service is a software licensing and delivery model in which software is licensed on a subscription basis and is centrally hosted. It is sometimes referred to as on-demand software. On-premise software is the exact opposite where the delivery of product is inside the particular organizations infrastructure.

Understanding how customers use a product is critical to giving them what they want. In a SAAS environment where everything is happening online and in the cloud, companies can gain a far more accurate picture  

2

 

The idea of moving to cloud based subscription model gives the business to understand more about the product usage of customers. This gives them the edge to serve better to the customers. The shift in the industry shall not be ignored. Big Data is really being used now to understand how and where to improve the product.

 

The Indian IT industry is focusing mainly on Cloud, Analytics, Mobile and Social segment to further drive growth. This Software-as-a-service delivery model can certainly give the edge to do data analysis on where and how the product is used.

 

 

There are number of reasons why Software-as-a-service is beneficial to organizations:

 

  • No additional hardware costs, you can buy the processing power or hardware as per the requirement. Do not have to go for high end configuration as there is no requirement. Need based subscription.
  • Usage is scalable. You can scale whenever you require.
  • Applications can be customised.
  • Accessible from any location, rather than being restricted to installations on individual computers an application can be accessed from anywhere with an internet enabled device.

 

The adoption of cloud based delivery model is accelerating mainly because of the analytical capability it gives the business to understand the customers. Analytics rocks!.

 

For state of the art big data training in Pune, look no further than DexLab Analytics. It is a renowned institute that excels in Big data hadoop certification in Pune. For more information, visit their official site.

 

Interested in a career in Data Analyst?

To learn more about Machine Learning Using Python and Spark – click here.
To learn more about Data Analyst with Advanced excel course – click here.
To learn more about Data Analyst with SAS Course – click here.
To learn more about Data Analyst with R Course – click here.
To learn more about Big Data Course – click here.

How Vital Is It to Measure KPIs for Future Success

How Vital Is It to Measure KPIs for Future Success

As I discussed earlier, Analytics is highly quantitative in nature. In this blog, we will discuss about the importance of Key Performance Indicators and how does KPIs help in measuring the organization’s performance and analytics.

Continue reading “How Vital Is It to Measure KPIs for Future Success”

How is big data analytics a strategic part of business decisions today?

(Decode big data analytics and its relevance in today’s business scenario)

How is big data analytics a strategic part of business decisions today?

Businesses are getting increasingly competitive, and thanks to the connected community through multiple devices. The way people are shopping or looking for products or solutions has undergone a paradigm shift. We are living in a changed world in terms of the way technology intersects almost all aspects of business, notwithstanding the size and scale.

Continue reading “How is big data analytics a strategic part of business decisions today?”

Call us to know more