What is ETL testing?
ETL stands for extract, transform and load which is a tool used for ensuring that the data is loaded properly from a source to another destination. It is also called table balancing. The main functionality of using the process is to verify the data being processed and find out the errors.
What are the ETL testing concepts?
ETL testing is used for errors in the data before it gets processed for analytics.
The common functions of ETL testing are as follows:
- Mapping of the data sources and destination
- Analysis of the data model to be used
- Verification of the data at the source
- Quality and integration for the data
- Verification of the relations between two tables are secured during data transfer
How to perform ETL testing?
The ETL testing process is carried out by the following steps:
- Identifying the requirements of client – The user requirements, business flow and the model used for the data are identified here to meet the user’s demands.
- Validating the requirements – The data is validated to remove duplicate or redundant data. The data type of the table and column must match with the data model specifications.
- Test cases design – The test cases are designed here along with the SQL scripts and mapping scenarios.
- Checking the cases – The test cases are evaluated before execution.
- Running the test cases
- Closure of the entire test process
What is the current ETL testing process?
The current ETL testing processes are:
- New data warehouse testing – Here, the data obtained from the source and destination databases are checked using the ETL tools.
- Migration testing – Here the data is moved to the new destination database from the old source database for improving the efficiency and quality of data.
- Report testing – The data warehouse reports are tested here for getting a better understanding of the report.
- Alter application – The data pulled from different databases is added to the same data warehouse.
What is the transformation in ETL testing?
ETL transformation is carried out using multiple SQL queries as the process is not simple. The process is performed for cleansing the data.
Two ways to approach data transformation in ETL testing are:
- Multistage data transformation – It is the basic ETL process where the data is moved to an area where the transformation operations occur. It is then transferred to the warehouse.
- In-warehouse data transformation – It is the process where data is extracted and put into the analytics warehouse. Transformation operations are performed there.
How to use SQL functions in ETL testing?
SQL functions are used in ETL based on the requirement. For example, if the data between the source and the target system needs to be compared, the COUNT() function is used.
The query will be like this:
For the source data, SELECT COUNT(*) FROM S_TABLE WHERE ID IS NOT NULL
For the target system, SELECT COUNT(*) FROM S_TABLE a1, T_TABLE b1, WHERE b1.ID= NOT NULL AND a1.ID=b1.ID
The COUNT() function can also be used for duplicate data check testing.
Additionally, for validating the data completeness the MINUS and INTERSECT functions have to be used.
How to write test cases for ETL testing?
ETL test cases are written when certain information like ETL architecture, database schema of the destination and the target, mapping data and data model that will be used; are present.
They are written using the following steps:
- Identifying the requirements of client – The user requirements, business flow and the model used for the data are identified here to meet the user’s demands.
- Validating the requirements – The data is validated to remove duplicate or redundant data. The data type of the table and column must match with the data model specifications.
- Test cases design – The test cases are designed here along with the SQL scripts and mapping scenarios.
- Checking the cases – The test cases are evaluated before execution.
- Run the test cases
- Close the entire test process
What tools are used to analyze and validate data in ETL testing?
The most popular tools used for testing and validation in ETL testing are:
- QuerySurge
- Informatica Data Validation
- Data Centric Testing
- SSISTester
- Talend Open Studio for Data Integration
- GTL QAceGen
- Dbfit
- Codoid’s ETL Testing Services
- TestBench
What is a target system in ETL testing?
In the data uploading process in ETL, the data is transferred from the data warehouse source to the destination or the target system. The target system can be a data warehouse or a flat file.
Difference between database testing and ETL testing?
The differences between Database testing and ETL testing are given below:
ETL Testing | Database Testing |
The primary function is data extraction, transfer, loading, and validation of data from the source to the destination system. | It focuses on data integration, data correctness, validation and accuracy of the data in the table. |
The data has more aggregations, joins and is not normalized. | Data is normalized having many joins. |
In ETL Testing data is checked for duplication when it is loaded from one system to another. | In this testing data is checked for missing values and null values |
The foreign and primary key relations are preserved here during the ETL process. | Data redundancy and whether the database normalization is evaluated here. |
The multidimensional data model is used here while testing. | The entity-relationship data model is used here. |
What is ETL bridging testing?
Bridging can be described as the process of transferring data from one data warehouse to another. The ETL process creates a bridge between two data systems, hence the name.
What is the dimension and fact table in ETL testing?
In a data warehouse, a dimension table resides in a star schema. It contains attributes, dimension keys, and values. These tables can range from a few rows to a thousand columns. It also contains the dimensions of a fact.
Using foreign keys, a dimension table can be joined to a fact table. It is usually connected to a fact table and contains descriptive attributes.
A fact table consists of the measures used in any business operation like its metrics or facts. It is surrounded by the dimensions and is connected to a dimension table. It is surrounded by dimensions and contains sales information like Product and Price. Its primary key is used as a foreign key to connect it with a dimension table.
Define partitioning in ETL Testing?
Partitioning is the process of dividing a storage space into subdivisions for better management of data, easily recover or backup data and to enhance the overall performance of the storage system. The types of partitioning are horizontal partitioning, vertical partitioning, round robin and hash partitioning.
What is a fact?
A fact table consists of the measures used in any business operation like its metrics or facts. It is surrounded by the dimensions and is connected to a dimension table. It is surrounded by dimensions and contains sales information like Product and Price.
What are different types of fact in ETL?
The different types of facts in ETL are:
- Transaction fact tables – This table consists of events that have already occurred. If a transaction has happened, a row will exist
- Accumulated fact tables – This table is used to depict the process activity.
- Snapshot fact tables – This table is used to depict the state of a process at a point of time.
Differentiate between Fact and Dimension in the table?
The differences between Fact and Dimension in the table are as follows:
Fact table | Dimension table |
A fact table consists of the measures used in any business operation like its metrics or facts. | A dimension table resides in a start schema and contains attributes, dimension keys and values. |
No hierarchy here | Contains hierarchy like |
Its primary key works as a foreign key for the Dimension table. | It has a foreign key to the facts table. |
Fact table contains sales data and set of dimensions. | Dimension table It contains the details of dimension like Product type, Date of production, Customer ID etc. |
What is the surrogate key?
A surrogate key in ETL Testing is a set of columns or a column that can be considered as the primary key, as an alternative of the natural key. Many natural keys can be called candidate keys that can be considered as primary keys. Therefore, a surrogate key is also a candidate key.
A very commonly used surrogate key is the incrementing integer that is used to perform increment operations on columns.
Explain the difference between the surrogate key and primary key.
A surrogate key in ETL Testing is a set of columns or a column that can be considered as the primary key, as an alternative of the natural key. Many natural keys can be called candidate keys that can be considered as primary keys. Therefore, a surrogate key is also a candidate key.
A primary key is a key that is used to uniquely identify a table. For example, for a table having student details, the primary key will be the Student ID, that is unique for each student.
Define partitioning, in ETL?
Partitioning - It is the process of dividing a storage space into subdivisions for better management of data, easily recover or backup data and to enhance the overall performance of the storage system.
The types of partitioning are:
- horizontal partitioning,
- vertical partitioning,
- round robin
- hash partitioning.
Define hash partitioning and round-robin partitioning in ETL?
Round Robin – Here, the data or information is distributed equally among the partitions. If the amount of rows in each partition is the same, the process is carried out. There is no partition key.
Hash partitioning – Here, a hash function is applied by the server to develop partition keys. The data among the partitions are grouped this way. It is used for the process groups of rows that have the same partition key.