Register Login

Hadoop Interview Questions

Hadoop FAQ's

1. What is the role of the command ‘jps’ in Hadoop?

A: The Scala ‘jps’ command provides the condition of the daemons running the Hadoop cluster and returns the status of namenode, secondary namenode, datanode , Task tracker, and Jobtracker in output.

2. In what modes can Hadoop be run?

A: Hadoop may be run in one of three modes: Standalone, Pseudo-distributed, or Fully Distributed.

3. What happens to a Namenode without any data?

A: A Namenode without data is excluded from being a part of the Hadoop Cluster.

4. What is meant by ‘Big Data’?

A: Big Data refers to a collection of large and complex data which makes it difficult to work with using standard data processing methods or database management tools.

5. What characteristics identify Big Data?

A: Four features of Big Data or the four V’s are:

Volume: Describes the size of data.

Velocity: Deals with streaming data analysis.

Variety: Describes different types of data.

Veracity: Denotes the data uncertainty.

6. What is the advantage offered by Hadoop?

A: With the ever increasing usage of digital devices and access to the internet, massive amounts of data keep flowing into the systems. This mass of unstructured data cannot be dealt with using the traditional procedures. Hadoop has been designed to retrieve and process this massive amount of data stored in different systems in different locations in a simple, fast, and effective way. It divides the query into small portions and processes them simultaneously.

7. How is Hadoop different from RDBMS?

A: RDBMS or Relational Database Management System is used in transactional systems for storing and archiving data, while Hadoop stores and processes large amounts of data in a distributed file system. In simpler terms, Hadoop takes up Big Data and analyses it later, while RDBMS seeks out individual records from Big Data.

8. What is meant by Fault Tolerance?

A: Fault Tolerance is a sort of Backup system of Hadoop, where a file stored somewhere is automatically copied in two other places. If one or even two of these are lost due to technical problems or accidents, a copy is still available.

9. Are calculations on the data also replicated thrice?

A: No, calculations are done only on the data retrieved from one of the copies or nodes, located by the master node.

10. What is meant by Namenode?

A: Namenode is the master node containing the metadata. It maintains the other data nodes. Namenode failure jeopardizes the entire operation hence it is a high availability machine and not commodity hardware.

11. What is meant by Datanode?

A: Datanodes are slaves to the master node which are located on the participating machines and serves as the actual data storage. They also continuously communicate with the Namenode by sending heartbeats.

12. Why is Hadoop Distributed File System (HDFS) used for large data sets in a single file rather than in multiple files?

A: The Namenode is a high performing system and is expensive. Many small files generate a large amount of metadata which can clog up the Namenode. Hence HDFS is preferred for large amount of data in a single file rather than spread over many small files.

13. What is meant by Job Tracker?

A: Job Tracker is a daemon running on the Namenode to submit and track jobs and assigning tasks to task trackers.

14. What is meant by Task Tracker?

A: Task Tracker is the daemon running on datanodes that executes tasks at datanodes and communicates with the Job Tracker via heartbeats.

15. What is meant by heartbeat?

A: As the name indicates, heartbeats are signals sent by Datanodes to Namenodes and by Task Trackers to Job Trackers indicating they are alive and working. Non reception of heartbeats by Namenodes or Job Trackers causes the task to be assigned to a different Datanode or Task Tracker respectively.

16. What is meant by block in HDFS?

A: A block in HDFS is the minimum size of data that may be read or written. Files are stored in block sized portions. The default size for HDFS is 64 MB.

17. How is data indexing performed in HDFS?

A: HDFS data indexing is done by making the last part of data stored in a block indicate the location of the next block of data.

18. What is meant by Rack?

A: A Rack is a physical storage area where a collection of Datanodes in a single location is kept.

19. What is the function of Secondary Namenode?

A: The Secondary Namenode continuously reads data from Namenode RAM and writes it in the file system. It is not a substitute for the Namenode.

20. What is the function of the Reducer?

A: The Datanode creates a Key-Value pair of the result after performing some task assigned to it by the Namenode, and returns this intermediate pair to Reducer. The Reducer gathers these Key-Value pairs and combines them to create the final output.

21. What is meant by Rack Awareness?

A: Rack Awareness is the system by which Namenodes decide block placement so that minimum network traffic takes place between datanodes of the same rack.

22. What is the function of Combiner?

A: The combiner performs a small reduction process exclusively on data produced by mappers. Output from the Combiner is received by the Reducer for data generated by mapper.

23. What is meant by Speculative Execution?

A: Speculative Execution is the process by which if a node seems to work slowly, the master node simultaneously executes the same task in another node and accepts the output from whichever node returns it first.

24. What is meant by Checkpointing?

A: The Namenode creates progressively larger edit logs with each file added to the cluster, significantly reducing its startup time. Checkpointing is the process by which edit logs are combined to create an fsimage and the Namenode is started directly using it. New edit logs are also constantly added to the fsimage by checkpointing.

25. What is meant by Bootstrap?

A: Bootstrap is a front-end mobile first web-development framework using CSS, HTML, and Javascript.