What is HDFS (Hadoop Distributed File System)?
HDFS stands for Hadoop Distributed File System. It is the file system of the Hadoop framework. It was designed to store and manage huge volumes of data in an efficient manner. HDFS has been developed based on the paper published by Google about its file system, known as the Google File System (GFS).
HDFS is a Userspace File System. Traditionally file systems are embedded in the operating system kernel and runs as an operating system process. But HDFS is not embedded in the operating system kernel. It runs as a User process within the process space allocated for user processes, on the operating system process table. On a traditional process system the block size is of 4-8KB whereas in HDFS the default block size is of 64MB.
- HDFS <- GFS
- Userspace File System
- Default Block Size = 64 MB
Advantages of HDFS
- It can be implemented on commodity hardware.
- It is designed for large files of size up to GB/TB.
- It is suitable for streaming data access, that is, data is written once but read multiple times. For example, Log files where the data is written once but read multiple times.
- It performs automatic recovery of file system upon when a fault is detected.
Disadvantages of HDFS
- It is not suitable for files that are small in size.
- It is not suitable for reading data from a random position in a file. It is best suitable for reading data either from beginning or end of the file.
- It does not support writing of data into the files using multiple writers.