【Hadoop】学习

Posted by 西维蜀黍 on 2023-09-22, Last Modified on 2023-09-22

Hadoop

Hadoop allows the distributed processing of large data sets stored across clusters of computers.

The Hadoop framework consists of two main components

  • Hadoop Distributed File System (HDFS)
    • HDFS is an open source variant of the Google File System (GFS)
  • MapReduce programming framework
    • Hadoop MapReduce is the open source variant of Google MapReduce

ZooKeeper

ZooKeeper is used to coordinate the cluster in hadoop framework

Several Hadoop projects are already using ZooKeeper to coordinate the cluster and provide highly-available distributed services

Apache Avro

Apache Avro is used for Remote Procedure Calls (RPC) in the system. It supports self describing schema for the data, where data is described by a schema and also stored in the same file as the data it describes.

  • Pig
    • Pig provides an engine for executing data flows in parallel on Hadoop. It uses MapReduce to execute all of its data processing
  • Hive
    • Hive is the data warehouse infrastructure developed by Facebook. It is used for data summarization, query, and analysis
    • HiveQL is a SQL-like language
  • HBase
    • HBase is a Hadoop database inspired from Google BigTable and non-relational distributed database. It is used as a storage system for MapReduce jobs outputs.
    • Most useful to store column-oriented, very large tables for random, real-time read/write operations.
  • Sqoop
    • Sqoop is a system for bulk data transfer between HDFS and structured data-stores as RDBMS. Spark is a framework for writing fast, distributed programs
  • Spark
    • Spark solves similar problems as Hadoop MapReduce does, but with a fast in-memory approach and a clean functional style API

Reference