Speaking of Big Data, the first name that many people will think of is Hadoop. What is Hadoop?
Hadoop, formally known as Apache Hadoop, is a software framework for managing large data sets and distributed processing on a computer cluster.
Personally, I view Hadoop as a computing and data storage platform because it has two main functions:
- Storage management
- Manage computing resource management
And I say that Hadoop is a platform because Hadoop allows us to build programs to run on Hadoop using the resources provided by Hadoop.
Hadoop is made up of four key components:
- Common – is a centralized libraries and utilities that support the functionality of other modules in Hadoop.
- Distributed File System (HDFS) – It is a distributed file system that allows users to easily manage large files.
- YARN (Yet Another Resource Negotiator) – A framework for job scheduling and resource management on the Hadoop cluster.
- MapReduce – MapReduce is a programming model or programming paradigm designed to write programs for parallel processing. Simply put, we can write programs to run on multiple computers at the same time in our Hadoop cluster.
Hadoop Architecture
The hero that made Hadoop known has two parts: HDFS and YARN.
It is a file system that manages the data files within the Hadoop Cluster system by dividing the files into pieces (blocks). And distributing each block to be stored on different machines in the cluster. In clustering to guarantee that if an event in a cluster fails.
By using the user, the system can see the drive or access the files in HDFS with only one location. (In fact, files are distributed and stored on multiple machines.) Users do not need to know how many machines there are in a cluster. When we keep a lot of files and less storage space. We were able to add a new computer to the Hadoop cluster (Horizontal Scaling) to free up file storage space without shutting down the system.
YARN (Yet Another Resource Negotiator) and MapReduce.
Hadoop YARN is a resource (memory and process) manager within the Hadoop Cluster, acting like a housekeeper, scheduling tasks to run smoothly in the Hadoop system. Especially the work that is Map Reduce.
Apache Hadoop YARN
All programs written in MapReduce Model will run entirely on YARN. The programmer does not have to write programming hassle to manage the resources in the system as before because YARN manages the resources.
The Hadoop Ecosystem
Several software solutions that have evolved into the Hadoop ecosystem. Each of which represents the unique capabilities of Hadoop to deliver performance. As shown in the example below.
- ZooKeeper – A service traffic management assistant distributed on the Hadoop Cluster as a housekeeper for Hadoop programs or services.
- STORM – A processor that manages streaming data or real-time data.
- Solr Lucene – a full-text search or HDFS-based search engine.
- Mahout – A collection of tools for managing machine learning tasks and mathematical models.
- Hive – This is an SQL engine that query data with sql commands, runs on MapReduce for those familiar with querying data in sql.
- Pig – is a platform used for data management services. They have easy-to-understand instructions or high-level language programming. It also relies on the ability to execute from the MapReduce framework or even run in Spark mode.
- Spark – Cluster computing framework, used to build large-scale data processing applications. It has a function similar to MapReduce, but Spark works in-memory unlike MapReduce, which requires a disk.
- HBase – A non-RDBMS database system that stores data on HDFS
- Cassandra – A NoSQL database system that handles big data , offers scalability.
- Flume – A tool capable of handling log data including collection, aggregation, and transferring of large log data that is streaming.
- Kafka – Real-time transmission and transmission wizard capable of handling large streaming data.
- Sqoop – A data transfer wizard between RDBMS and Hadoop. Can import data from legacy database to Hadoop or export data from Hadoop to database.
How does Hadoop handle big data?
As in the previous article, there are four types of big data (4Vs) that we have to deal with. Volume, Velocity, Variety, and Veracity. Hadoop can expand the system. Horizontal scale.We can add a simple machine or PC to the Hadoop cluster to easily expand the disk space, memory space , computing power.
- V1: Volume – Hadoop has the ability to handle large volumes of data comfortably since Hadoop has HDFS than above. That can support large files, plus there are many ecosystems that can handle data that are likely to increase like HBase or Cassandra, etc.
- V2: Velocity – Rapid increase in data rate gain. Data that flows into streaming or manage data in real-time, Hadoop has tools to manage these things like Storm or Kafka.
- V3: Variety – Variety of information. There are many forms of information in the system. With HDFS, we can handle all kinds of file formats. When it comes to unstructured or scalable data (such as XML or JSON), we can store this kind of data with the capabilities of a NoSQL database like Cassandra.
- V4: Veracity – The Reliability section of the data. This requires the ability of a data scientist or a data scientist to manage or decide whether.
- V5: Value —V The last V-Value is the value obtained from information. Or making information useful to users. We need to use our Machine Learning or Data Mining knowledge to generate new knowledge from the vast amounts of data we collect. Hadoop also has an ecosystem that facilitates us. There are tools or libraries to help you build machine learning models.
Why is Hadoop so popular with Big Data?
Low cost computing system – Hadoop is an open-source software that we can download for free. With the terms of Apache License 2.0 and Hadoop, it can run on normal computers such as home PCs or old notebooks, it can be made as a Hadoop Cluster, not necessarily a server.
Effective Ecosystem – Hadoop provides a suite of powerful ecosystem programs. In addition to being free There are also many great helper programs included, including Hive, HBase, Pig, Spark, Sqoop, and many others.
Easy-to-Scale – We can easily add nodes to a Hadoop Cluster without much time consuming and money.
conclude
Hadoop is a software framework designed specifically to deal with Big Data, so it can meet almost every big data requirement.The ecosystem also has great tools for developers to use or develop. Plus save money but it does not seem suitable for average users. Installing and running Hadoop requires quite deep IT knowledge in System Admin and Software Development to be able to deploy Hadoop effectively.