Apache Spark Archives - Page 2 of 2 - DexLab Analytics | Big Data Hadoop SAS R Analytics Predictive Modeling & Excel VBA

The Success Story of Big Data Tooling

The Success Story of Big Data Tooling

The world of hadoop data tooling is flourishing. It’s being said, Hadoop is shifting from possible data warehousing to an accomplished big data analytics set-up.

Back in the day, right after Hadoop at Yahoo was first invented, proponents of big data asserted its potential for substituting enterprise data warehouses, framed on business intelligence.

Open source Hadoop data tooling became a preferred choice more as an alternative to those insanely expensive existing systems – as a result, over time, the focus shifted to expanding existing data warehouses and more. Intricate Hadoop applications today are known as data lakes and of late big data tooling is found swelling beyond meager data warehouses.

“We are seeing increasing capabilities on the Hadoop and open source side to take over more and more of the corporation’s data and workloads, including BI,” said Mike Matchett, an analyst and founder of the Small World Big Data consultancy.


Self Service and Big Data

In August, Cloudera launched Workload XM management services designed exclusively for cloud-based analytics. Alternatively, the company built a hybrid Cloudera Data Warehouse and a Cloudera Altus Data Warehouse, capable of running over both Microsoft Azure clouds and AWS.

The main objective of management services is to bring forth some visibility into various data workloads. Workload XM is constructed to aid administrators in presenting reliable service-level agreements for self-service analytics applications – says Anupam Singh, GM of Analytics at Cloudera, Palo Alto, Calif.

Importantly, Singh also mentioned that the cloud warehouse offers encryption for data both at still and in motion, and provides a better view into the trajectory of data sets in analytics workloads. Such potentials have gained momentum and recognition as well as GDPR and other programs.

However, all these discussions boil down to one point, which is how to increase the use of big data analytics. “Customers don’t look at buzzwords like Hadoop and cloud. But they do want more business units to access the data,” he added.

Data on the Wheels

Hadoop player, Hortonworks is a Cloud aficionado. In June, the company broadened its Google Cloud existence with Google Cloud Storage support. Enhancing real-time data analytics and management is a priority.

Meanwhile, in August, Hortonworks churned out Streams Messaging Manager (SMM) with an objective of handling data streaming and provide administrators comprehensive views into Kafka messaging clusters. They have increasingly become popular amongst big data pipelines.

These management tools are crucial for moving Hadoop-inspired big data analytics into production capacities, where in data warehouses fails performing – thus, recommendation engines and fraud detection appears to be a saving grace!

Meanwhile, Kafka-related capabilities in SMM are going on getting advanced and with recently released Hortonworks DataFlow 3.2, the performance for data streaming amplified.

R Adaptability

Similar to its competitors, MapR has bolstered its capabilities beyond its original scope of being used as a mere data warehouse replacement. Early this year, the organizers released a new version of its MapR Data Platform equipped with better streaming data analytics and new item data services that would easily work on cloud as well as premises.

As final thoughts, the horizon of Hadoop is expanding, while data tooling keeps modifying. However, today, unlike before, Hadoop is not only the sole choice for doing data analytics – the choice includes Apache Spark and Machine Learning. All being extremely superior and effective when put to use.

If you are looking for Apache Spark Certification, drop by DexLab Analytics. Their Apache Spark Training program is extremely well-crafted and in sync with industry demands. For more, visit the site.


The article has been sourced from — searchdatamanagement.techtarget.com/news/252448331/Big-data-tooling-rolls-with-the-changing-seas-of-analytics


Interested in a career in Data Analyst?

To learn more about Data Analyst with Advanced excel course – Enrol Now.
To learn more about Data Analyst with R Course – Enrol Now.
To learn more about Big Data Course – Enrol Now.

To learn more about Machine Learning Using Python and Spark – Enrol Now.
To learn more about Data Analyst with SAS Course – Enrol Now.
To learn more about Data Analyst with Apache Spark Course – Enrol Now.
To learn more about Data Analyst with Market Risk Analytics and Modelling Course – Enrol Now.

DexLab Analytics’ AUGUST OFFER: Everything You Need to Know Of

DexLab Analytics’ AUGUST OFFER: Everything You Need to Know Of

We are happy to announce that we’re rolling some good news your way – DexLab Analytics is all set to launch exhaustive modules in Deep Learning with AI starting with Artificial Neural Networks using Python, MS Excel, Dashboards, VBA Macros, Tableau BI, Visualization and Python Spark for Big Data from September 1, 2018. The course modules are on in-demand skills and they are taking the world quite by a storm.

DexLab Analytics’ AUGUST OFFER

Big data, data science and artificial intelligence are buzz words these days. More and more people are coming forward and showing keen interest on these nuanced notions that solves real-world problems. This is why we didn’t want to fall behind. We understand the importance of data in this digitized world, and accordingly have chalked out our intensive industry-ready courses.

Deep Learning and AI starting with Artificial Neural Networks using Python course module is a 30-hour long training program that gives exposure to MLP, CNN, RNN, LSTM, Theano, TensorFlow and Keras. It includes more than 8 projects out of which a couple of focuses on development of models in to Image and Text recognition. MS Excel, Dashboards and VBA Macros certification is curated by the expert consultants after combining industry expertise with academician’s knowledge. The course duration is in total of 24 hours and is conducted by seasoned professionals with more than 8 years of industry experience specific to this budding field of science.

DexLab Analytics’ August Offer is On Machine Learning & AI

DexLab Analytics’ August Offer is On Machine Learning & AI

Next, we have30-hour hands-on classroom training on Tableau BI & Visualization certification, which teaches young minds how graphical representation of data unlocks company future trends and take quicker decisions. Tableau is one of the fastest evolving BI and data visualization tool. With that in mind, we offer a learning path to all you students by framing a structured approach coupled with easy learning methodology and course curriculum.  

DexLab Analytics Offers MS Excel, Dashboards and VBA Macros Certification!

DexLab Analytics Offers MS Excel, Dashboards and VBA Macros Certification!

Lastly, our Big Data with PySpark certification is another gem in the learner’s cap: the Spark Python API (PySpark) exposes users to the Spark Programming model with Python. Apache Spark is an open source and is touted as a significant big data framework for pivoting your tasks in a cluster. The main objective of this course is to teach budding programmers how to write python code using map-reduce programming model. The 40-hours hands-on classroom training will talk about Big Data, overview of Hadoop, Python, Apache Spark, Kafka, PySpark and Machine Learning.

Now, first 12 students who happen to register for each course on or before 30th August, 2018 will get alluring discount offer on the total course fee. Interesting, isn’t it? So, what are you waiting for? Go, grab all the details about AUGUST OFFER: to register, call us at +91 9315 725 902 / +91 124 450 2444 or hit the link below – www.dexlabanalytics.com/contact


Interested in a career in Data Analyst?

To learn more about Data Analyst with Advanced excel course – Enrol Now.
To learn more about Data Analyst with R Course – Enrol Now.
To learn more about Big Data Course – Enrol Now.

To learn more about Machine Learning Using Python and Spark – Enrol Now.
To learn more about Data Analyst with SAS Course – Enrol Now.
To learn more about Data Analyst with Apache Spark Course – Enrol Now.
To learn more about Data Analyst with Market Risk Analytics and Modelling Course – Enrol Now.

An ABC of Apache Spark Streaming

Estimator Procedure under Simple Random Sampling: EXPLAINED

Apache Spark has become one the most popular technologies. It is accompanied with a powerful streaming library, which has quite a few advantages over other technologies. The integration of Spark streaming APIs with Spark core APIs provides a dual purpose real-time and batch analytical platform. Spark Streaming can also be combined with SparkSQL, SparkML and GraphX when complex cases need to be handled. Famous organizations that prevalently use Spark Streaming are Netflix, Uber and Pinterest. Spark Streaming’s fame in the world of data analytics can be attributed to its fault tolerance, ability to process live streams, scalability and high throughput.


Need for Streaming Analytics:

Companies generate enormous amounts of data on a daily basis. Transactions happening over the internet, social network platforms, IoT devices, etc. generate large volumes of data that need to be leveraged in real-time. And this process shall gain more important in future. Entrepreneurs consider real-time data analysis as a great opportunity to scale up their businesses.

Spark streaming intakes live data streams, Spark engine processes and divides it and the output is in the form of batches.

Architecture of Spark Streaming:

Spark streaming breaks the data stream into micro batches (known as discretize stream processing). First of all, the receivers accept data in parallel and hold it in worker nodes as buffer. Then the engine runs brief tasks and sends the result to other systems.

Spark tasks are allocated to workers dynamically, that depends on the resources available and the locality of data. The advantages of Spark Streaming are many, including better load balancing and speedy fault recovery. Resilient distributed dataset (RDD) is the basic concept behind fault tolerant datasets.

Useful features of Spark streaming:

Easy to use: Spark streaming supports Java, Scala and Python and uses the language integrated API of Apache Spark for stream processing. Stream jobs can be written in a similar manner in which batch jobs are written.

Spark Integration: Since Spark streaming runs on Spark, it can be utilized for addressing unplanned queries and reusing similar codes. Robust interactive applications can also be designed.

Fault tolerance: Work that has been lost can be recovered without additional coding from the developer.

Benefits of discretized stream processing:

Load balancing: In Spark streaming, the job load is balanced across workers. While, some workers handle more time-consuming tasks, others process tasks that take less time. This is an improvement from traditional approaches where one task is processed at a time. This is because if the task is time-taking then it behaves like a bottle neck and delays the whole pipeline.

Fast recovery: In many cases of node failures, the failed operators need to be restarted on different nodes. Recomputing lost information involves rerunning a portion of the data stream. So, the pipeline gets halted until the new node catches up after the rerun. But in Spark, things work differently. Failed tasks can be restarted in parallel and the recomputations are distributed across different nodes evenly. Hence, recovery is much faster.

Spark streaming use cases:

Uber: Uber collects gigantic amounts of unstructured data from mobile users on a daily basis. This is converted to structured data and sent for real time telemetry analysis. This data is analyzed in an ETL pipeline build using Spark streaming, Kafka and HDFS.

Pinterest: To understand how Pinterest users are engaging with pins globally, it uses an ETL data pipeline to provide information to Spark through Spark streaming. Hence, Pinterest aces the game of showing related pins to people and providing relevant recommendations.

Netflix: Netflix relies on Spark streaming and Kafka to provide real-time movie recommendations to users.

Apache foundation has been inaugurating new techs, such as Spark and Hadoop. For performing real-time analytics, Spark streaming is undoubtedly one of the best options.

As businesses are swiftly embracing Apache Spark with all its perks, you as a professional might be wondering how to gain proficiency in this promising tech. DexLab Analytics, one of the leading Apache Spark training institutes in Gurgaon, offers expert guidance that is sure to make you industry-ready. To know more about Apache Spark certification courses, visit Dexlab’s website.

This article has been sources from: https://intellipaat.com/blog/a-guide-to-apache-spark-streaming-tutorial


Interested in a career in Data Analyst?

To learn more about Data Analyst with Advanced excel course – Enrol Now.
To learn more about Data Analyst with R Course – Enrol Now.
To learn more about Big Data Course – Enrol Now.

To learn more about Machine Learning Using Python and Spark – Enrol Now.
To learn more about Data Analyst with SAS Course – Enrol Now.
To learn more about Data Analyst with Apache Spark Course – Enrol Now.
To learn more about Data Analyst with Market Risk Analytics and Modelling Course – Enrol Now.

A Comprehensive Article on Apache Spark: the Leading Big Data Analytics Platform

A Comprehensive Article on Apache Spark: the Leading Big Data Analytics Platform

Speedy, flexible and user-friendly, Apache Spark is one of the main distributed processing frameworks for big data in the world. This technology was developed by a team of researchers at U.C. Berkeley in 2009, with the aim to speed up processing in Hadoop systems. Spark provides bindings to programming languages, like Java, Scala, Python and R and is a leading platform that supports SQL, machine learning, stream and graph processing. It is extensively used by tech giants, like Apple, Microsoft and IBM, telecommunications industry and games organizations.

Databricks, a firm where the founding members of Apache Spark are now working, provides Databricks Unified Analytics Platform. It is a service that includes Apache Spark clusters, streaming and web-based notebook development. To operate in a standalone cluster mode, one needs Apache Spark framework and JVM on each machine in a cluster. To reap the advantages of a resource management system, running on Hadoop YARN is the general choice. Amazon EMR and Google Cloud Dataproc are fully-managed cloud services for running Apache Spark.


Working of Apache Spark:

Apache Spark has the power to process data from a variety of data storehouses, such as Hadoop Distributed File System (HDFS) and NoSQL databases. It is a platform that enhances the functioning of big data analytics applications through in-memory processing. It is also equipped to carry out regular disk-based processing in case of large data sets that are unable to fit into system memory.

Spark Core:

Apache Spark API (Application Programming Interface) is more developer-friendly compared to MapReduce, which is the software framework used by earlier versions of Hadoop. Apache Spark API hides all the complicated processing steps from developers, like reducing 50 lines of MapReduce code for counting words in a file to only a few lines of code in Apache Spark. Bindings to well-liked programming languages, like R and Java, make Apache Spark accessible to a wide range of users, including application developers and data analysts.

Spark RDD:

Resilient Distributed Dataset is a programming concept that encompasses an immutable collection of objects for distribution across a computing cluster. For fast processing, RDD operations are split across a computing cluster and executed in a parallel process. A driver core process divides a Spark application into jobs and distributes the work among different executor processes. The Spark Core API is constructed based on RDD concept, which supports functions like merging, filtering and aggregating data sets. RDDs can be developed from SQL databases, NoSQL stores and text files.

Apart from Spark Core engine, Apache Spark API includes libraries that are applied in data analytics. These libraries are:

  • Spark SQL:

Spark SQL is the most commonly used interface for developing applications. The data frame approach in Spark SQL, similar to R and Python, is used for processing structured and semi-structured data; while SQL2003-complaint interface is for querying data. It supports reading from and writing to other data stores, like JSON, HDFS, Apache Hive, etc. Spark’s query optimizer, Catalyst, inspects data and queries and then produces a query plan that performs calculations across the cluster.

  • Spark MLlib:

Apache Spark has libraries that can be utilized for applying machine learning techniques and statistical operation to data. Spark MLlib allows easy feature extractions, selections and conversions on structured datasets; it includes distributed applications of clustering and classification algorithms, such as k-means clustering and random forests.  

  • Spark GraphX:

This is a distributed graph processing framework that is based on RRDs; RRD being immutable makes GraphX inappropriate for graphs that need to be updated, although it supports graph operations on data frames. It offers two types of APIs, Pregel abstraction and a MapReduce style API, which help execute parallel algorithms.

  • Spark Streaming:

Spark streaming was added to Apache Spark to help real-time processing and perform streaming analytics. It breaks down streams of data into mini-batches and performs RDD transformations on them. This design facilitates the set of codes written for batch analytics to be used in stream analytics.

Future of Apache Spark:

The pipeline structure of MLlib allows constructing classifiers with a few lines of code and applying Tensorflow graphs and Keras models on data. The Apache Spark team is working to improve streaming performance and facilitate deep learning pipelines.

For knowledge on how to create data pipelines and cutting edge machine learning models, join Apache Spark programming training in Gurgaon at Dexlab Analytics. Our experienced consultants ensure that you receive the best apache spark certification training.  


Interested in a career in Data Analyst?

To learn more about Machine Learning Using Python and Spark – click here.

To learn more about Data Analyst with Advanced excel course – click here.
To learn more about Data Analyst with SAS Course – click here.
To learn more about Data Analyst with R Course – click here.
To learn more about Big Data Course – click here.

Foster your Machine Learning Efforts with these 5 Best Open Source Frameworks

Foster your Machine Learning Efforts with these 5 Best Open Source Frameworks

Machine Learning is rapidly becoming the mainstream and changing the way we carry out tasks. While many factors have contributed to this current boom in machine learning, the most important reason is the wide availability of open source frameworks.

’Open source’ refers to a program that is created as a collaborative effort in which programmers improve the code and share the changes within the community. Open source sprouted in the technological community in response to proprietary software owned by corporations. The rationale for this movement is that programmers not concerned with proprietary ownership or financial gain will produce a more useful product for everyone to use. 

Framework: It refers to a cluster of programs, libraries and languages that have been manufactured for use in application development. The key difference between a library and a framework is ‘’inversion of control’’. When a method is summoned from a library, the user is in control. With a framework the control is inverted- the framework calls the user.

If you are plunging full-fledged into machine learning, then you clearly need relevant resources for guidance. Here are the top 5 frameworks to get you started.

  1. TensorFlow:

TensorFlow was developed by the Google Brain Team for handling perceptual and language comprehending tasks. It is capable of conducting research on machine learning and deep neural networks. It uses a Python-based interface. It’s used in a variety of Google products like handling speech recognition, Gmail, photos and search.

A nifty feature about this framework is that it can perform complex mathematical computations and observe data flow graphs. TensorFlow grants users the flexibility to write their own libraries as well. It is also portable. It is able to run in the cloud and on mobile computing platforms as well as with CPUs and GPUs.

  1. Amazon Machine Learning (AML):

AML comes with a plethora of tools and wizards to help create machine learning models without having to delve into the intricacies of machine learning. Thus it is a great choice for developers. AML users can generate predictions and utilize data services from the data warehouse platform, Amazon Redshift. AML provides visualization tools and wizards that guide developers. Once the machine learning models are ready  AML makes it easy to obtain predictions using simple APIs.

  1. Shogun:

 Abundant in state-of-the-art algorithms, Shogun makes for a very handy tool. It is written in C++ and provides data structures for machine learning problems. It can run on Windows, Linux and MacOS. Shogun also proves very helpful as it supports uniting with other machine learning libraries like SVMLight, LibSVM, libqp, SLEP, LibLinear, VowpalWabbit and Tapkee to name a few.

  1. NET:

Accord.NET is a machine learning framework which possesses multiple libraries to deal with everything from pattern recognition, image and signal processing to linear algebra, statistical data processing and much more. What makes Accord so valuable is its ability to offer multiple things which includes 40 different statistical distributions, more than 30 hypothesis tests, and more than 38 kernel functions.

  1. Apache Signa, ApacheSpark MLibApache, and Apache Mahout:

These three frameworks have plenty to offer. Apache Signa is widely used in natural language processing and image recognition. It is also adept in running a varied collection of hardware.

Mahout provides Java libraries for a wide range of mathematical operations. Spark MLlib was built with the aim of making machine learning easy. It unites numerous learning algorithms and utilities, including classification, clustering, dimensionality reduction and many more.

 With the advent of open source frameworks, companies can work with developers for improved ideas and superior products. Open source presents the opportunity to accelerate the process of software development and meet the demands of the marketplace.

Boost your machine learning endeavors by enrolling for the Apache Spark training course at DexLab Analytics where experienced professionals ensure that you become proficient in the field of machine learning.


Interested in a career in Data Analyst?

To learn more about Machine Learning Using Python and Spark – click here.

To learn more about Data Analyst with Advanced excel course – click here.
To learn more about Data Analyst with SAS Course – click here.
To learn more about Data Analyst with R Course – click here.
To learn more about Big Data Course – click here.

How India is driving towards Data Governance

Data is power – it’s the quintessential key to proper planning, governance, policy decisions and empowering communities. In the recent times, technological expansion is found to be contributing immensely towards ensuring a sustainable future and building promising IT base. Robust developments in IT related services have resulted into key breakthroughs, including Big Data, which as a result have triggered smooth data governance.

How India is driving towards Data Governance

According to a NASSCOM report, India’s analytics market is expected to grow from $1 billion to $2.3 billion in the year 2017-18. However, the fuller benefits of data analytics are yet to be channelized by the public sector.

In a varied country like India, data collection is a lengthy procedure. At present, information is being collected by various government departments straight from Panchayat levels to state levels. Though, most of the data remains trapped within department walls, it is largely used to pan out performance reports. Also, certain issues in timely collection of data pops up, while sometimes the quality of data collected becomes questionable, hence delaying the entire analysis.




Quality data plays an integral role, if analyzed properly at the proper time. They can be crucial for decision-making, delivery of services and important policy revisions. As a matter of fact, last year, Comptroller and Auditor General (CAG) initiated Centre for Data Management and Analytics (CDMA) to combine and incorporate relevant data for the purpose of auditing. The main purpose here is to exploit the data available in government archives to build a more formidable and powerful Indian audit and accounts department.

Indian government is taking several steps to utilize the power of data – Digital India and Smart Cities initiatives aim to employ data for designing, planning, managing, implementing and governing programs for a better, digital India. Many experts are of the opinion that government reforms would best work if they are properly synchronized with data to determine the impact of services, take better decisions, boost monitoring programmes and improve system performances.

Open Data Policy is the need of the hour. Our government is working towards it, under the jurisdiction of the Department of Information and Technology (DIT) to boost the perks of sharing information across departments and ministries. Harnessing data eases out the load amongst the team members, while ensuring better accountability.

Tech startups and companies that probe into data and looks for solutions in data hoarding and analytics to collect and manage complicated data streams need to be supported. The government along with local players should encourage citizens to help them in collecting adequate information that could help them in long-run. India is walking towards a rapid economic development phase, where commitment towards information technology, data governance and open-source data is of prime importance. For the overall economy, bulk investments in capacity building, technology implementation and data-facilitating structures should be considered and implementable to bring plans and participation into place to hit off a better tech-inspired reality.

For data analyst certification in Delhi NCR, drop by DexLab Analytics – it’s a prime data science online training centre situated in the heart of Delhi.

The original article appeared on – https://economictimes.indiatimes.com/small-biz/security-tech/technology/indias-investment-in-big-data-will-ensure-governance/articleshow/57960046.cms


Interested in a career in Data Analyst?

To learn more about Machine Learning Using Python and Spark – click here.

To learn more about Data Analyst with Advanced excel course – click here.
To learn more about Data Analyst with SAS Course – click here.
To learn more about Data Analyst with R Course – click here.
To learn more about Big Data Course – click here.

How Data Analytics Is Shaping and Developing Improved Storage Solutions

Technology has penetrated deep into our lives – the last 5 decades of IT sector have been characterized by intense development in electronic storing solutions for recordkeeping.

How Data Analytics Is Shaping and Developing Improved Storage Solutions

Today, every file, every document is stored and archived safely and efficiently – rows of data are tabled in spreadsheets and stored in SQL relational databases for smooth access anytime by anyone, of course the authorized persons. Data is omnipresent. It is being found in data warehouses, data lakes, data mines and in pools. It is so much large in volume nowadays, that it can even be calculated in something like a Brontobyte.


Information is power. Data stored in archives are used to make accurate forecasts. And the data evaluation has begun within a subset of mathematics powered by a discipline named probability and statistical analysis.


Slowly, this discipline evolved into Business Intelligence that further into Data Science. The latter is the most sought after and well-paid career option for today’s tech-inspired generation. Grab a data science certification in Gurgaon and push your career to success.


Big Data Storage Challenges and Solutions

The responsibility of storage, ensuring security and provide accessibility for data is huge. Managing volumes and volumes of data is posing a challenge in itself – for example, even powering and cooling enough HDD RAID arrays to keep an Exabyte of raw data tends to break the bank for many companies.


Software-defined storage and flash devices are being deployed for big data storage. They promise of better direct business benefit. Also, increasingly Apache Spark Hadoop or simply Spark is taking care of the software side of big data analytics. Whether your big data cluster is developed on these open-source architectures or some other big data frameworks, it will for sure impact your storage decisions.


Hadoop is in this business of storage for big data for quite some time now. It is a robust open-source framework opted for suave processing of big data. It led to the emergence of server clusters and Facebook is known to have the largest Hadoop cluster containing millions of nodes.


Now, the question remains where and how you proceed with Hadoop – there are so many differing opinions about how you approach Hadoop clusters, at times it may leave you exasperated. For that, we can help you here.


With a huge array of data at play, we suggest to deploy a dedicated processing, storage and networking system in different racks to avoid latency or performance issues. It is for the same reasons, we ask you to stay away running Hadoop in a virtual environment.

Instead, implement HDFS (Hadoop Distributed File System) – it is perfect for distributed storage and processing with the help of commodity hardware. The structure is simple, tolerant, expandable and scalable.


Besides, the cost of data storage should also be given a look at – cost should be kept low and data compression features should likely to be implemented.

For Big Data Hadoop certification in Delhi NCR, drop by DexLab Analytics.


The Takeaway

Times are changing, and so are we. Big data analytics are becoming more real-time, hence better you scale up to real-time analytics. Today, data analytics have gone way beyond the conventional desktop considerations – it has now become a lot more, and to keep pace with the analytics evolution, you need to have sound storage infrastructure, where possible upgrades to computing, storage and networking is easily available and implementable.


To answer about big data or Hadoop, power yourself up with a good certification in Big Data Hadoop from DexLab Anlaytics – such intensive big data courses do help!


Interested in a career in Data Analyst?

To learn more about Machine Learning Using Python and Spark – click here.

To learn more about Data Analyst with Advanced excel course – click here.
To learn more about Data Analyst with SAS Course – click here.
To learn more about Data Analyst with R Course – click here.
To learn more about Big Data Course – click here.

How Precision Medicine is breaking off Chokehold on Healthcare with Big Data?

How Precision Medicine is breaking off Chokehold on Healthcare with Big Data?

Big data is showering its miraculous effects on a range of industries. And the healthcare industry is not left out of the bandwagon. Precision medicine is at the brink of a revolution in individualizing treatment, and healthcare professionals are devising ways to prevent and treat diseases with granularity down to a single patient’s genome. Nevertheless, many out there shudders thinking if such humongous amounts of personal data stored in servers becomes vulnerable to threats from attackers. What will happen then?

It is expected the global precision medicine market will hit $88.64 billion – FYI, precision market is a specialized domain that includes data on a patient’s genes, lifestyle and environment to draw a clear picture of his/her health.

Busting the Security Challenges

Numerous efforts are being implemented to secure the storage facilities in which large chunks of genetic information are stored. Last year, a leading cyber-security company, Northrop Grumman Corp. published a white paper penning down clear guidelines about how to secure precision medicine data. The company seeks out to aid the National Institute of Standards and Technology and the White House Precision Medicine Initiative.

To this, the AHA’s Institute for Precision Cardiovascular Medicine developed the Precision Medicine Platform to boost research and treatment of this particular kind of treatment. The platform is rich in functions, including high-end analytic tools that enable advanced computing and sharing of clinical trial data, hospital data, pharmaceutical data and personal data. The security build-up in here is very strong, and it passes through all crucial compliance tests, according to Laura Stevens, AHA data scientist – “Even if you have data that you’d like to use, it’s sort of a walled garden behind your data so that it’s not accessible to people that don’t have access to the data, and it’s also HIPPA compliant. It meets the utmost secure standards of healthcare today,” she explained.


Boons of Data

The National Institutes of Health is creating a database to store genetic information to facilitate researchers in curing and preventing cancer and other diseases. It aims to collect data from around 1 million Americans. For applying data on a larger, more diverse population range, genetic information should be collected from larger demographics – that’s more feasible.

The AHA’s Myresearchlegacy.org invites individuals to donate their health, genetic and lifestyle data to aid researchers in treating patients. At present, the researchers are busy conducting precision medicine studies on treating diseases, like pancreatic, breast and other types of cancers. Not much development would have been possible without the advancement in computing power and storage coupled with big data and AI.

“The combination of benefits from process optimization, the ongoing transformation of medical data collection along the analog to digital continuum, and the availability of cheap memory and processing power and coding talent make the evolution of precision medicine inevitable,” David Sable, who runs the Special Situations Life Sciences Fund wrote in Forbes. Apart from managing the fund, he teaches entrepreneurship in biotechnology at Columbia University.

For always, the platform services, like clusters with Apache Spark big data framework, Amazon Elastic MapReduce and EMR thrives to pump up aggregation and analytics. Sometimes, where AI and machine learning tends to be time-consuming, EMR clusters work like a miracle in scaling and making the entire set of things faster to implement, thereby answering research questions faster and identifying crucial insights related to healthcare.

For in-depth understanding on Apache Spark, get certified in Apache Spark Training by DexLab Analytics. They are a prime Apache Spark Training institute in India.


Interested in a career in Data Analyst?

To learn more about Data Analyst with Advanced excel course – Enrol Now.
To learn more about Data Analyst with R Course – Enrol Now.
To learn more about Big Data Course – Enrol Now.

To learn more about Machine Learning Using Python and Spark – Enrol Now.
To learn more about Data Analyst with SAS Course – Enrol Now.
To learn more about Data Analyst with Apache Spark Course – Enrol Now.
To learn more about Data Analyst with Market Risk Analytics and Modelling Course – Enrol Now.

Write ETL Jobs to Offload the Data Warehouse Using Apache Spark

Write ETL Jobs to Offload the Data Warehouse Using Apache Spark

The surge of Big Data is everywhere. The evolving trends in BI have taken the world in its stride and a lot of organizations are now taking the initiative of exploring how all this fits in.

Leverage data ecosystem to its full potential and invest in the right technology pieces – it’s important to think ahead so as to reap maximum benefits in IT in the long-run.

“By 2020, information will be used to reinvent, digitalize or eliminate 80% of business processes and products from a decade earlier.” – Gartner’s prediction put it so right!

The following architecture diagram entails a conceptual design – it helps you leverage the computing power of Hadoop ecosystem from your conventional BI/ Data warehousing handles coupled with real time analytics and data science (data warehouses are now called data lakes).


In this post, we will discuss how to write ETL jobs to offload data warehouse using PySpark API from the genre of Apache Spark. Spark with its lightning-fast speed in data processing complements Hadoop.

Now, as we are focusing on ETL job in this blog, let’s introduce you to a parent and a sub-dimension (type 2) table from MySQL database, which we will merge now to impose them on a single dimension table in Hive with progressive partitions.

Stay away from snow-flaking, while constructing a warehouse on hive. It will reduce useless joins as each join task generates a map task.

Just to raise your level of curiosity, the output on Spark deployment alone in this example job is 1M+rows/min.

The Employee table (300,024 rows) and a Salaries table (2,844,047 rows) are two sources – here employee’s salary records are kept in a type 2 fashion on ‘from_date’ and ‘to_date’ columns. The main target table is a functional Hive table with partitions, developed on year (‘to_date’) from Salaries table and Load date as current date. Constructing the table with such potent partition entails better organization of data and improves the queries from current employees, provided the to_date’ column has end date as ‘9999-01-01’ for all current records.

The rationale is simple: Join the two tables and add load_date and year columns, followed by potent partition insert into a hive table.

Check out how the DAG will look:


Next to version 1.4 Spark UI conjures up the physical execution of a job as Direct Acyclic Graph (the diagram above), similar to an ETL workflow. So, for this blog, we have constructed Spark 1.5 with Hive and Hadoop 2.6.0

Go through this code to complete your job easily: it is easily explained as well as we have provided the runtime parameters within the job, preferably they are parameterized.

Code: MySQL to Hive ETL Job

__author__ = 'udaysharma'
# File Name: mysql_to_hive_etl.py
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, HiveContext
from pyspark.sql import functions as sqlfunc

# Define database connection parameters
MYSQL_DRIVER_PATH = "/usr/local/spark/python/lib/mysql-connector-java-5.1.36-bin.jar"
MYSQL_PASSWORD = '********'
MYSQL_CONNECTION_URL = "jdbc:mysql://localhost:3306/employees?user=" + MYSQL_USERNAME+"&password="+MYSQL_PASSWORD 

# Define Spark configuration
conf = SparkConf()
conf.set("spark.executor.memory", "1g")

# Initialize a SparkContext and SQLContext
sc = SparkContext(conf=conf)
sql_ctx = SQLContext(sc)

# Initialize hive context
hive_ctx = HiveContext(sc)

# Source 1 Type: MYSQL
# Schema Name  : EMPLOYEE
# Table Name   : EMPLOYEES
# + --------------------------------------- +
# + --------------------------------------- +
# | EMP_NO     | INT          | PRIMARY KEY |
# | BIRTH_DATE | DATE         |             |
# | FIRST_NAME | VARCHAR(14)  |             |
# | LAST_NAME  | VARCHAR(16)  |             |
# | GENDER     | ENUM('M'/'F')|             |
# | HIRE_DATE  | DATE         |             |
# + --------------------------------------- +
df_employees = sql_ctx.load(

# Source 2 Type : MYSQL
# Schema Name   : EMPLOYEE
# Table Name    : SALARIES
# + -------------------------------- +
# + -------------------------------- +
# | EMP_NO      | INT  | PRIMARY KEY |
# | SALARY      | INT  |             |
# | TO_DATE     | DATE |             |
# + -------------------------------- +
df_salaries = sql_ctx.load(

# Perform INNER JOIN on  the two data frames on EMP_NO column
# As of Spark 1.4 you don't have to worry about duplicate column on join result
df_emp_sal_join = df_employees.join(df_salaries, "emp_no").select("emp_no", "birth_date", "first_name",
                                                             "last_name", "gender", "hire_date",
                                                             "salary", "from_date", "to_date")

# Adding a column 'year' to the data frame for partitioning the hive table
df_add_year = df_emp_sal_join.withColumn('year', F.year(df_emp_sal_join.to_date))

# Adding a load date column to the data frame
df_final = df_add_year.withColumn('Load_date', F.current_date())


# Registering data frame as a temp table for SparkSQL
hive_ctx.registerDataFrameAsTable(df_final, "EMP_TEMP")

# Target Type: APACHE HIVE
# Database   : EMPLOYEES
# Table Name : EMPLOYEE_DIM
# + ------------------------------- +
# + ------------------------------- +
# | EMP_NO     | INT    |           |
# | BIRTH_DATE | DATE   |           |
# | FIRST_NAME | STRING |           |
# | LAST_NAME  | STRING |           |
# | GENDER     | STRING |           |
# | HIRE_DATE  | DATE   |           |
# | SALARY     | INT    |           |
# | FROM_DATE  | DATE   |           |
# | TO_DATE    | DATE   |           |
# | YEAR       | INT    | PRIMARY   |
# | LOAD_DATE  | DATE   | SUB       |
# + ------------------------------- +
# Storage Format: ORC

# Inserting data into the Target table
            SALARY, FROM_DATE, TO_DATE, year, Load_date FROM EMP_TEMP")

As we have the necessary configuration mentioned in our code, we will simply call to run this job

spark-submit mysql_to_hive_etl.py

As soon as the job is run, our targeted table will consist 2844047 rows just as expected and this is how the partitions will appear:





The best part is that – the entire process gets over within 2-3 mins..

For more such interesting blogs and updates, follow us at DexLab Analytics. We are a premium Big Data Hadoop institute in Gurgaon catering to the needs of aspiring candidates. Opt for our comprehensive Hadoop certification in Delhi and crack such codes in a jiffy!


Interested in a career in Data Analyst?

To learn more about Data Analyst with Advanced excel course – Enrol Now.
To learn more about Data Analyst with R Course – Enrol Now.
To learn more about Big Data Course – Enrol Now.

To learn more about Machine Learning Using Python and Spark – Enrol Now.
To learn more about Data Analyst with SAS Course – Enrol Now.
To learn more about Data Analyst with Apache Spark Course – Enrol Now.
To learn more about Data Analyst with Market Risk Analytics and Modelling Course – Enrol Now.

Call us to know more