Big Data Hadoop Course in Pune | Big Data Hadoop Course in Mumbai

Big Data - Hadoop & Spark (AWS) (Python Track)

Mode: Online / Offline (Classroom) / Hybird
*Prerequisite: Knowledge of Python, Machine Learning, SQL

Research suggests that by the end of 2021 India alone will face a shortage of about two lac data scientists. The probable growth of Big Data in India is because of the awareness of the benefits that insights from unstructured data can impact businesses.
Jobs for Hadoop developers in on the rise as organizations from different verticals such as e-commerce, retail, automobile, telecom are adopting analytics to gain an advantage over their competitors.
Data volumes will continue to increase and with such an exponential increase in usage of data analytics, the Global Big Data Market will grow with an anticipated CAGR of 18.68% during the forecast period & will reach revenue of $183.62 billion by 2027.
The average salary for Big Data Hadoop Analysts ranges from $68,465 to $138,808

Gamnaka AI offers best big data course in Pune with 100% placement assistance

Program Structure

  • Big data introduction
    • What is big data?
    • V’s of Big data
    • (Volume,Velocity,Variety,Veracity)
    • Data types
    • Distributed System
    • Single system vs distributed system
    • Solution for Big data : Hadoop
  • Hadoop core components
    • Diff v1 &v2
    • Overview of Hadoop eco system
    • Map reduce
  • Introduction to AWS & Cloud
    • Cloud computing
    • AWS basics
    • AWS services
    • Setting up AWS freetier Account
    • big data computation on AWS
    • Access Permissions with S3
    • SQL vs. NoSQL Databases
    • Databases and Big Data on AWS
    • Working on EMR with Hive
  • Spark overview
    • Spark Architecture
    • RDD
    • Ml lib
    • Linear Regression on spark
    • logistic regression on spark 
    • decision tree on spark
    • naive bayers on spark
    • Xgboost On Spark
  • AWS ML tools
    • Amazon Sagemaker

Duration: 2 Months / 30+ hours

Hybrid – Combination of Online & Offline(classroom)

Projects/Case Studies

Any 2 Case Studies (T & C apply)

Industry: General

Problem Statement: How to successfully import data using Sqoop into HDFS for data analysis

Topics: As part of this project, you will work on the various Hadoop components like MapReduce, Apache Hive and Apache Sqoop. You will have to work with Sqoop to import data from relational database management system like MySQL data into HDFS. You need to deploy Hive for summarizing data, querying and analysis. You have to convert SQL queries using HiveQL for deploying MapReduce on the transferred data. You will gain considerable proficiency in Hive and Sqoop after the completion of this project.

  • Sqoop data transfer from RDBMS to Hadoop
  • Coding in Hive Query Language
  • Data querying and analysis

Industry: Media and Entertainment

Problem Statement: How to create the top-ten-movies list using the MovieLens data

Topics:In this project you will work exclusively on data collected through MovieLens available rating data sets. The project involves writing MapReduce program to analyze the MovieLens data and creating the list of top ten movies. You will also work with Apache Pig and Apache Hive for working with distributed datasets and analyzing it.

  • MapReduce program for working on the data file
  • Apache Pig for analyzing data
  • Apache Hive data warehousing and querying


Problem Statement: How to bring the daily data (incremental data) into the Hadoop Distributed File System

Topics: In this project, we have transaction data which is daily recorded/stored in the RDBMS. Now this data is transferred everyday into HDFS for further Big Data Analytics. You will work on live Hadoop YARN cluster. YARN is part of the Hadoop ecosystem that lets Hadoop to decouple from MapReduce and deploy more competitive processing and wider array of applications. You will work on the YARN central resource manager.

  • Using Sqoop commands to bring the data into HDFS
  • End-to-end flow of transaction data
  • Working with the data from HDFS

Industry: Banking

Problem Statement: How to improve the query speed using Hive data partitioning

Topics: This project involves working with Hive table data partitioning. Ensuring the right partitioning helps to read the data, deploy it on the HDFS and run the MapReduce jobs at a much faster rate. Hive lets you partition data in multiple ways. This will give you hands-on experience in partitioning of Hive tables manually, deploying single SQL execution in dynamic partitioning and bucketing of data so as to break it into manageable chunks.

  • Manual Partitioning
  • Dynamic Partitioning
  • Bucketing

Industry: Social Network

Problem Statement: How to deploy ETL for data analysis activities

Topics: This project lets you connect Pentaho with the Hadoop ecosystem. Pentaho works well with HDFS, HBase, Oozie and ZooKeeper. You will connect the Hadoop cluster with Pentaho data integration, analytics, Pentaho server and report designer. This project will give you complete working knowledge on the Pentaho ETL tool.

  • Working knowledge of ETL and Business Intelligence
  • Configuring Pentaho to work with Hadoop distribution
  • Loading, transforming and extracting data into Hadoop cluster

Industry: General

Problem Statement: How to setup a Hadoop real-time cluster on Amazon EC2

Topics: This is a project that gives you opportunity to work on real world Hadoop multi-node cluster setup in a distributed environment. You will get a complete demonstration of working with various Hadoop cluster master and slave nodes, installing Java as a prerequisite for running Hadoop, installation of Hadoop and mapping the nodes in the Hadoop cluster.

  • Hadoop installation and configuration
  • Running a Hadoop multi-node using a 4-node cluster on Amazon EC2
  • Deploying of MapReduce job on the Hadoop cluster

Industry: General

Problem Statement: How to test MapReduce applications

Topics:In this project, you will gain proficiency in Hadoop MapReduce code testing using MRUnit. You will learn about real-world scenarios of deploying MRUnit, Mockito and PowerMock. This will give you hands-on experience in various testing tools for Hadoop MapReduce. After completion of this project you will be well-versed in test-driven development and will be able to write light-weight test units that work specifically on the Hadoop architecture.

  • Writing JUnit tests using MRUnit for MapReduce applications
  • Doing mock static methods using PowerMock and Mockito
  • MapReduce Driver for testing the map and reduce pair

Industry: Internet Services

Problem Statement: How to derive insights from web log data

Topics: This project is involved with making sense of all the web log data in order to derive valuable insights from it. You will work with loading the server data onto a Hadoop cluster using various techniques. The web log data can include various URLs visited, cookie data, user demographics, location, date and time of web service access, etc. In this project, you will transport the data using Apache Flume or Kafka, workflow and data cleansing using MapReduce, Pig or Spark. The insight thus derived can be used for analyzing customer behavior and predict buying patterns.

  • Aggregation of log data
  • Apache Flume for data transportation
  • Processing of data and generating analytics

Industry: General

Problem Statement: How to administer a Hadoop cluster

Topics: This project is involved with working on the Hadoop cluster for maintaining and managing it. You will work on a number of important tasks that include recovering of data, recovering from failure, adding and removing of machines from the Hadoop cluster and onboarding of users on Hadoop.

  • Working with name node directory structure
  • Audit logging, data node block scanner and balancer
  • Failover, fencing, DISTCP and Hadoop file formats

Industry: Social Media

Problem Statement: Find out what is the reaction of the people to the demonetization move by India by analyzing their tweets

Topics: This Project involves analyzing the tweets of people by going through what they are saying about the demonetization decision taken by the Indian government. Then you look for key phrases and words and analyze them using the dictionary and the value attributed to them based on the sentiment that they are conveying.

  • Download the tweets and load into Pig storage
  • Divide tweets into words to calculate sentiment
  • Rating the words from +5 to −5 on AFFIN dictionary
  • Filtering the tweets and analyzing sentiment

Industry: Sports and Entertainment

Problem Statement: Analyze the entire cricket match and get answers to any question regarding the details of the match

Topics: This project involves working with the IPL dataset that has information regarding batting, bowling, runs scored, wickets taken and more. This dataset is taken as input, and then it is processed so that the entire match can be analyzed based on the user queries or needs.

  • Load the data into HDFS
  • Analyze the data using Apache Pig or Hive
  • Based on user queries give the right output

Advantages of joining Gamaka AI

  • Instructor led online & classroom interactive sessions
  • One-To-One online problem-solving sessions
  • Complete Soft Copy of Notes & Latest Interview Preparation Set
  • Trainers are working IT professional with top IT MNC’s
  • 100% Placement Assistance
  • Resume Building & Mock Interview Sessions
  • 100% Hands-on Training with Live Projects/Case Studies
  • Internship & Course Completion Certificate
  • 1 Year free subscriptions to portal for updated guides, notes, poc, projects & interview preparation set.
  • Extensive training programs with Recorded Sessions
  • 24*7 Support on enquiry@gamakaai.com

Download Brochure

Fee Structure


Fees: ₹30,000/-

50% OFF

₹ 15,500/-

2 installments

₹ 7,500/-
(10 days gap)
Down Payment 

₹ 14,000/-


Fees: ₹65,000/-

50% OFF

₹ 32,500/-

2 installments

₹ 16,000/-
(10 days gap)
Down Payment 

₹ 31,000/-


Fees: ₹80,000/-
50% OFF

₹ 39,500/-

2 installments


(10 days gap)

Down Payment 

₹ 38,000/-

  • Registration – ₹ 500/-
  • Weekdays & Weekends Batches – Flexible Timings

Will I get certified?

Upon successful completion of this data science course, you’ll earn a Certificate. The certificate adds the required weight in any portfolio.

big data training in pune
data science internship in pune

Internship Certificate

This certificate will be issued to those pursuing internships with our development team or clients with whom we have tie-ups. Data Science Internship gives opportunity to learn from professionals, gain practical experience in this field, and build a robust professional network.

Copyright © 2020 Gamaka AI | All Rights Reserved