What is HDFS (Hadoop Distributed File System)?
HDFS stands for Hadoop Distributed File System. It is the file system of the Hadoop framework. It was designed to store and manage huge volumes of data in an efficient manner. HDFS has been developed based on the paper published by Google about its file system, known as the Google File System (GFS).
HDFS is a Userspace File System. Traditionally file systems are embedded in the operating system kernel and runs as an operating system process. But HDFS is not embedded in the operating system kernel. It runs as a User process within the process space allocated for user processes, on the operating system process table. On a traditional process system the block size is of 4-8KB whereas in HDFS the default block size is of 64MB.
- HDFS <- GFS
- Userspace File System
- Default Block Size = 64 MB
Advantages of HDFS
- It can be implemented on commodity hardware.
- It is designed for large files of size up to GB/TB.
- It is suitable for streaming data access, that is, data is written once but read multiple times. For example, Log files where the data is written once but read multiple times.
- It performs automatic recovery of file system upon when a fault is detected.
Disadvantages of HDFS
- It is not suitable for files that are small in size.
- It is not suitable for reading data from a random position in a file. It is best suitable for reading data either from beginning or end of the file.
- It does not support writing of data into the files using multiple writers.
Hadoop Daemons Core Component
An understanding of the Hadoop distributed file system Daemons
This HDFS consists of three Daemons which are:-
- Secondary Namenode.
All the nodes work the primary slave architecture.
The Namenode is the master node while the data node is the slave node. Within the HDFS, there is only a single Namenode and multiple Datanodes.
Functionality of Nodes
The Namenode is used for storing the metadata of the HDFS. This metadata keeps track and stores information about all the files in the HDFS. All the information is stored in the RAM. Typically the Namenode occupies around 1 GB of space to store around 1 million files.
The information stored in the RAM is known as file system metadata. This metadata is stored in a file system on a disc.
The Datanodes are responsible for retrieving and storing information as instructed by the Namenode. They periodically report back to the Namenodes about their status and the files they are storing through a heartbeat. The Datanodes stores multiple copies for each file that is present within the Hadoop distributed file system.