Spark Interview Questions

What is Apache Spark?

Apache Spark is an open source data processing engine that is used for Big Data applications. It is based on the Hadoop MapReduce. It is used for increasing the processing speed of the application as it has the power of cluster computing.

What are some important features of Apache Spark?

The main features of Apache Spark are as follows:

Analytics – The engine supports artificial intelligence, algorithms for graphs, machine learning and SQL queries.
Fast – The applications can be executed in the Hadoop cluster very fast. This is possible as the information needed for processing is stored in memory.
Language independence – It supports multiple languages like Python, Java and Scala.
Real-time data streaming – It supports real-time streaming of data and data manipulation using Spark Streaming.

What are the advantages of Apache Spark?

The advantages of Spark are given below:

The engine supports artificial intelligence, algorithms for graphs, machine learning and SQL queries.
The applications can be executed in the Hadoop cluster very fast. This is possible as the information needed for processing is stored in memory.
It supports multiple languages like Python, Java and Scala.
It supports real-time streaming of data and data manipulation using Spark Streaming.

What are the disadvantages of Apache Spark?

The disadvantages of Spark are given below:

It is very expensive as the memory requirement is a lot.
Spark does not have a proper file management system and it depends upon the file systems like Hadoop.
The jobs have to be optimized manually, which can be a difficult task.
The processing can be slower as transitional results are used iteratively. This iterative process of executing processing data can be very time taking.
It has high latency.

What languages does Apache Spark support? And which one is the most popular?

Apache Spark supports languages like Python, Java, R, Scala and SQL. Scala is the firm favourite for developers, and Python is the second.

How to do debugging in Apache Spark?

Using the Spark-shell can be used for debugging Apache Spark. It can be done by inspecting the variables, setting breakpoints using IDEs like IntelliJ.

What is Apache Spark cluster?

The applications developed using Spark execute independently as a process set called clusters. It executes with the help of a cluster manager. Such an application has a set of executing processes and a driver process, both of which are distributed across the cluster.

Difference between Spark and Hadoop?

The Difference between Spark and Hadoop are as follows:

Spark	Hadoop
It is an open source framework designed for fast computation.	It is basically a framework that is used for the processing and storage of big data.
It is a data analytics engine that supports artificial intelligence as well.	It is a basic data processing engine.
It has low latency.	It has high latency while computing.
It can process data in real time.	It cannot process data in real time and performs batch processing of information.
It is expensive as compared to Hadoop.	It is less costly.
The data security is low as compared to Hadoop.	The data security is high.

Why Apache Spark is faster than MapReduce?

Apache Spark is faster than MapReduce because as the information needed for processing data is stored in memory. However, in the case of MapReduce, after a map or reduce action with the data, the system persists the entire data back to the Hadoop file system. This makes Spark faster as the caching is done in memory, which enables the data to be processed faster.

How will you select the storage level in Apache Spark?

The storage level in Spark can be elected in the following ways:

The MEMORY_ONLY_SER option can be used in case the RDD objects do not fit in the memory. Select the serialization library for saving memory space.
The MEMORY_ONLY can be used if the RDD object fits properly in the system memory.
The final option is if the RDD object cannot fit in the memory and there is a huge gap between its size and the memory, the MEMORY_AND_DISK option can be

What are the libraries in Apache Spark?

The different libraries in Apache Spark are MLlib (Machine Learning library), Spark SQL, Spark Streaming, GraphX and it has additional packages for supporting languages like C#, Clojure and Julia.

What are different persistence levels in Apache Spark?

The different persistence in Spark are:

MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY

How to check Apache Spark version?

To check the Spark version, the following code has to be provided in the Spark shell:

Spark–submit --version or Spark-shell –version

What is RDD in Apache Spark?

RDD (Resilient Distributed Datasets) in Spark is a very important data structure. In RDD the Datasets are divided into logical partitions. The RDD can also be considered as an object collection that cannot be altered. These can only be developed using operation such as filter, map, group etc. These operations are applicable to all the data elements in the datasets. RDDs can consist of objects created in Scala, Python or Java.

How to create RDDs in Spark?

RDDs can be developed using the following methods:

An RDD can be created from an existing RDD using the process of Transformation. The process works like a function by taking an RDD as the input parameter, and producing a new RDD.
The parallelize() can also be used for creating RDDs. Here, an existing parallelized collection which is inside the program, can be passed as parameters to the parallelize() method.
RDDs can be created using an external database as a reference. The DataFrameReader interface is used to pull data from an external source.

How to remove a special character from a record in Apache Spark RDD?

Special characters can be removed from a record in Apache Spark RDD using user-defined functions. The following code can be used:

def remove_string: String => String = _.replaceAll("[_#]", "")

def remove_string_udf = udf(remove_string)

What is Apache Spark Graphx?

The Apache Spark GraphX is a library that is used for graph operations and computations. It is helpful for exploratory analysis, developing custom code to work with two or more graphs. It can be used to join RDDs with graphs also. The library has may pre-existing algorithms for graphical operations like SVD++, Page Rank and Triangle Count.

Explain what is YARN in Apache Spark.

YARN (Yet Another Resource Negotiator) is a technology that is used for management of the clusters in Big Data. Both Hadoop MapReduce and Spark can be run on Spark. It is a resource manager that allows the development and configuration of resources across all frameworks running on Spark. It is also used for scheduling jobs in Spark.

How to store tweets into Apache Spark?

The tweets from Twitter can be stored in real time using the Spark Streaming library. It is able to pull live data streams from sources like Twitter and Kafka and they are passed into functions for processing. The live data stream is received by the Spark Streaming library and split into batches. These batches of data are processed using the Spark Engine. The Twitter package called Twitter.utils is used for operating tweets.

Difference between Apache ignite and Spark.

The differences between Apache Ignite and Spark are given below:

Spark	Apache Ignite
It is an open source data processing engine that is used for Big Data applications.	It is an open source data processing platform that also works as a distributed database.
No data is stored during processing, it is pulled from other sources.	It stores data using a key-value based store and supports query processing.
Data-driven payloads are supported as it is primarily based on RDD (Resilient Distributed Datasets).	Computational payloads are supported in Ignite.
It deals with read-only data as the RDDs are immutable.	It supports both read and write operations.

What does the executor do in Apache Spark?

Every Spark application is executed using its own individual process called Executor. Their lifetime is that of the Spark application. It reads and writes data from external sources. Dynamic allocation of executors is possible using the executors.

What is the difference between databricks and Apache Spark?

Apache Spark is an open source data processing engine that is used for Big Data applications. It is based on the Hadoop MapReduce. It is used for increasing the processing speed of the application.

Databricks is a platform used for data analytics that provides smooth integration between data analysts, data scientists and business analysts.It has all the technologies of Apache Spark like MLib, Spark Streaming, Spark SQL and GraphX.

Apache Spark Interview Questions and Answers