Пропустить команды ленты
Пропустить до основного контента
English Version
Перейти вверх

M20775 Performing Data Engineering on Microsoft HD Insight

The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.

Audience profile

The primary audience for this course is data engineers, data architects, data scientists, and data developers who plan to implement big data engineering workflows on HDInsight.


Before attending this course, students must have:

  • Programming experience using R, and familiarity with common R packages
  • Knowledge of common statistical methods and data analysis best practices
  • Basic knowledge of the Microsoft Windows operating system and its core functionality
  • Working knowledge of relational databases
After completing this course, students will be able to:
  • Deploy HDInsight Clusters
  • Authorizing Users to Access Resources
  • Loading Data into HDInsight
  • Troubleshooting HDInsight
  • Implement Batch Solutions
  • Design Batch ETL Solutions for Big Data with Spark
  • Analyze Data with Spark SQL
  • Analyze Data with Hive and Phoenix
  • Describe Stream Analytics
  • Implement Spark Streaming Using the DStream API
  • Develop Big Data Real-Time Processing Solutions with Apache Storm
  • Build Solutions that use Kafka and HBase

Course Outline

Module 1: Getting Started with HDInsight
  • What is Big Data?
  • Introduction to Hadoop
  • Working with MapReduce Function
  • Introducing HDInsight
  • Provision an HDInsight cluster and run MapReduce jobs
Module 2: Deploying HDInsight Clusters
  • Identifying HDInsight cluster types
  • Managing HDInsight clusters by using the Azure portal
  • Managing HDInsight Clusters by using Azure PowerShell
  • Create an HDInsight cluster that uses Data Lake Store storage
  • Customize HDInsight by using script actions
  • Delete an HDInsight cluster
Module 3: Authorizing Users to Access Resources
  • Non-domain Joined clusters
  • Configuring domain-joined HDInsight clusters
  • Manage domain-joined HDInsight clusters
  • Prepare the Lab Environment
  • Manage a non-domain joined cluster
Module 4: Loading data into HDInsight
  • Storing data for HDInsight processing
  • Using data loading tools
  • Maximising value from stored data
  • Load data for use with HDInsight
Module 5: Troubleshooting HDInsight
  • Analyze HDInsight logs
  • YARN logs
  • Heap dumps
  • Operations management suite
  • Analyze HDInsight logs
  • Analyze YARN logs
  • Monitor resources with Operations Management Suite
Module 6: Implementing Batch Solutions
  • Apache Hive storage
  • HDInsight data queries using Hive and Pig
  • Operationalize HDInsight
  • Deploy HDInsight cluster and data storage
  • Use data transfers with HDInsight clusters
  • Query HDInsight cluster data
Module 7: Design Batch ETL solutions for big data with Spark
  • What is Spark?
  • ETL with Spark
  • Spark performance
  • Create a HDInsight Cluster with access to Data Lake Store
  • Use HDInsight Spark cluster to analyze data in Data Lake Store
  • Analyzing website logs using a custom library with Apache Spark cluster on HDInsight
  • Managing resources for Apache Spark cluster on Azure HDInsight
Module 8: Analyze Data with Spark SQL
  • Implementing iterative and interactive queries
  • Perform exploratory data analysis
  • Build a machine learning application
  • Use zeppelin for interactive data analysis
  • View and manage Spark sessions by using Livy
Module 9: Analyze Data with Hive and Phoenix
  • Implement interactive queries for big data with interactive hive
  • Perform exploratory data analysis by using Hive
  • Perform interactive processing by using Apache Phoenix
  • Implement interactive queries for big data with interactive Hive
  • Perform exploratory data analysis by using Hive
  • Perform interactive processing by using Apache Phoenix
Module 10: Stream Analytics
  • Stream analytics
  • Process streaming data from stream analytics
  • Managing stream analytics jobs
  • Process streaming data with stream analytics
  • Managing stream analytics jobs
Module 11: Implementing Streaming Solutions with Kafka and HBase
  • Building and Deploying a Kafka Cluster
  • Publishing, Consuming, and Processing data using the Kafka Cluster
  • Using HBase to store and Query Data
  • Create a virtual network and gateway
  • Create a storm cluster for Kafka
  • Create a Kafka producer
  • Create a streaming processor client topology
  • Create a Power BI dashboard and streaming dataset
  • Create an HBase cluster
  • Create a streaming processor to write to HBase
Module 12: Develop big data real-time processing solutions with Apache Storm
  • Persist long term data
  • Stream data with Storm
  • Create Storm topologies
  • Configure Apache Storm
  • Stream data with Storm
  • Create Storm Topologies
Module 13: Create Spark Streaming Applications
  • Working with Spark Streaming
  • Creating Spark Structured Streaming Applications
  • Persistence and Visualization
  • Installing Required Software
  • Building the Azure Infrastructure
  • Building a Spark Streaming Pipeline
Course length
5 days (40 hours)

 Регистрация на курс


Для регистрации на курс воспользуйтесь личным кабинетом



 Облако тегов

Здесь будут отображаться тэги.(Upd)