A: The Scala ‘jps’ command provides the condition of the daemons running the Hadoop cluster and returns the status of namenode, secondary namenode, datanode , Task tracker, and Jobtracker in output.
A: Hadoop may be run in one of three modes: Standalone, Pseudo-distributed, or Fully Distributed.
A: A Namenode without data is excluded from being a part of the Hadoop Cluster.
A: Big Data refers to a collection of large and complex data which makes it difficult to work with using standard data processing methods or database management tools.
A: Four features of Big Data or the four V’s are:
Volume: Describes the size of data.
Velocity: Deals with streaming data analysis.
Variety: Describes different types of data.
Veracity: Denotes the data uncertainty.
A: With the ever increasing usage of digital devices and access to the internet, massive amounts of data keep flowing into the systems. This mass of unstructured data cannot be dealt with using the traditional procedures. Hadoop has been designed to retrieve and process this massive amount of data stored in different systems in different locations in a simple, fast, and effective way. It divides the query into small portions and processes them simultaneously.
A: RDBMS or Relational Database Management System is used in transactional systems for storing and archiving data, while Hadoop stores and processes large amounts of data in a distributed file system. In simpler terms, Hadoop takes up Big Data and analyses it later, while RDBMS seeks out individual records from Big Data.
A: Fault Tolerance is a sort of Backup system of Hadoop, where a file stored somewhere is automatically copied in two other places. If one or even two of these are lost due to technical problems or accidents, a copy is still available.
A: No, calculations are done only on the data retrieved from one of the copies or nodes, located by the master node.
A: Namenode is the master node containing the metadata. It maintains the other data nodes. Namenode failure jeopardizes the entire operation hence it is a high availability machine and not commodity hardware.
A: Datanodes are slaves to the master node which are located on the participating machines and serves as the actual data storage. They also continuously communicate with the Namenode by sending heartbeats.
A: The Namenode is a high performing system and is expensive. Many small files generate a large amount of metadata which can clog up the Namenode. Hence HDFS is preferred for large amount of data in a single file rather than spread over many small files.
A: Job Tracker is a daemon running on the Namenode to submit and track jobs and assigning tasks to task trackers.
A: Task Tracker is the daemon running on datanodes that executes tasks at datanodes and communicates with the Job Tracker via heartbeats.
A: As the name indicates, heartbeats are signals sent by Datanodes to Namenodes and by Task Trackers to Job Trackers indicating they are alive and working. Non reception of heartbeats by Namenodes or Job Trackers causes the task to be assigned to a different Datanode or Task Tracker respectively.
A: A block in HDFS is the minimum size of data that may be read or written. Files are stored in block sized portions. The default size for HDFS is 64 MB.
A: HDFS data indexing is done by making the last part of data stored in a block indicate the location of the next block of data.
A: A Rack is a physical storage area where a collection of Datanodes in a single location is kept.
A: The Secondary Namenode continuously reads data from Namenode RAM and writes it in the file system. It is not a substitute for the Namenode.
A: The Datanode creates a Key-Value pair of the result after performing some task assigned to it by the Namenode, and returns this intermediate pair to Reducer. The Reducer gathers these Key-Value pairs and combines them to create the final output.
A: Rack Awareness is the system by which Namenodes decide block placement so that minimum network traffic takes place between datanodes of the same rack.
A: The combiner performs a small reduction process exclusively on data produced by mappers. Output from the Combiner is received by the Reducer for data generated by mapper.
A: Speculative Execution is the process by which if a node seems to work slowly, the master node simultaneously executes the same task in another node and accepts the output from whichever node returns it first.
A: The Namenode creates progressively larger edit logs with each file added to the cluster, significantly reducing its startup time. Checkpointing is the process by which edit logs are combined to create an fsimage and the Namenode is started directly using it. New edit logs are also constantly added to the fsimage by checkpointing.