Data Warehousing (DW) is a process of collecting and managing data from varied sources to provide meaningful business insights. A Data warehouse is typically used to connect and analyze business data from heterogeneous sources. For data analysis and reporting, the data warehouse is the core of the BI system.
It is a blend of technologies and components which aids the strategic use of data. Furthermore, it is a process of transforming data into information and making it available to users in a timely manner to make a difference.
1. What is real-time data warehousing?
Real-time Data Warehousing refers to systems that reflect the warehouse's status in real-time. When a query is run against the real-time data warehouse to learn about a particular aspect of a business or entity, the answer reflects the state of that entity at the time the query was run. Data warehouses typically have latent data - or data that reflects the business at a past time. Real-time data warehouses provide current data with low latency.
2. What are the testing phases in a data warehousing project?
The following phases are generally checked and optimized during the execution of the data warehousing project:
Performance and scalability: Ensure that data loads and queries perform within expected periods and that the technical architecture is scalable.
Integration testing. Confirm that the ETL process functions well with other upstream and downstream processes.
User-acceptance testing: Certify the data warehousing solution meets users’ current expectations and anticipates their future expectations.
Regression testing: Ensure existing functionality remains intact each time a new release of ETL code and data is completed.
3. What are the functions of a warehouse manager?
The data warehouse manager is expected to fulfill the following responsibilities:
- Monitor all long-term objectives.
- Train data warehouse staff.
- Ensure appropriate maintenance and development of all data.
- Evaluate staff performance.
- Administer database consolidation.
- Administer all Service Level Agreements.
- Maintain various big enterprises.
4. What is a snapshot in the context of Data warehousing?
Data warehouse snapshots can be used to track activities. An employee changing his address, for example, can alert the data warehouse for a snapshot to be taken. Whenever an event occurs, a snapshot is taken.
A snapshot in this regard has three components –
- The time when the event occurred.
- A key to identifying the snapshot.
- Data that relates to the key.
5. What is the junk dimension?
It is common in data warehouse design to run into yes/no indicator fields in source systems. Such information is rather crucial.
In other words, if we keep all those indicator fields in the fact table, we need to create many dimension tables as well as store a tremendous amount of information, resulting in performance and management issues.
The junk dimension is the solution to this problem. We combine these indicator fields into a junk dimension. Thus, we only need to build one dimension table, and the size and number of fields in the fact table can be reduced.
6. State the difference between materialized view and view.
Views are created by combining data from different tables. As a result, a view has no data of its own.
The Materialized view, which is usually used in data warehousing, has data. Decisions can be made based on this data, calculations can be performed, etc. By using queries, the data is calculated beforehand and stored.
Creating a view does not store data in the database. The data is created when a query is fired on the view. Whereas, data of a materialized view is stored.
7. What do you mean by conformed dimensions?
An attribute that conforms to each fact with which it is associated in data warehousing is called a conformed attribute. A conformed dimension ensures consistency of reporting across multiple facts and/or data marts by categorizing and describing facts and measures consistently.
8. What would you say is the main difference between a data warehouse system and an operational database?
The Operational Database is the source of information for the data warehouse. It includes detailed information used to run the day-to-day operations of the business. The data frequently changes as updates are made and reflect the current value of the last transactions.
Operational Database Management Systems also called OLTP (Online Transactions Processing Databases), are used to manage dynamic data in real-time.
Data Warehouse Systems serve users or knowledge workers for the purpose of data analysis and decision-making. Such systems can organize and present information in specific formats to accommodate the diverse needs of various users. These systems are called as Online-Analytical Processing (OLAP) Systems.
9. Explain a Data cube and its functionalities.
When data is grouped or combined in multidimensional matrices called Data Cubes. The data cube method has a few alternative names or a few variants, such as "Multidimensional databases," "materialized views," and "OLAP (On-Line Analytical Processing)."
The general idea of this approach is to materialize certain expensive computations that are frequently inquired about.
A data cube enables data to be modeled and viewed in multiple dimensions. A multidimensional data model is organized around a central theme, like sales and transactions. A fact table represents this theme. Facts are numerical measures. Thus, the fact table contains measures (such as Rs_sold) and keys to each of the related dimensional tables.
Dimensions are a fact that defines a data cube. Facts are generally quantities, which are used for analyzing the relationship between dimensions.
10. Define Data marting and its features.
A pattern used in a data warehouse environment to retrieve client data is called a data mart. It is a structure specific to the data warehouse and used by the business domain in the team. Every organization has a single data mart which is located in the data warehouse repository. Different types of data marts are dependent, independent, and hybrid data marts. Dependent data marts take data that is already created whereas independent data marts take data from external sources and from a data warehouse. We can call data marts logical subsets of the data warehouse.
11. How is Metadata different from a Data dictionary?
Metadata describes data. It has information about how and when, by whom certain data was collected, and the data format. This data of data or metadata is essential to understand information that is stored in data warehouses and XML-based web applications.
On the other hand, A data dictionary contains the basic definitions of a database. There is a list of files in the database, the number of records in each file, and information about the fields in the data dictionary.
12. What is ETL, and what is ETL testing?
In ETL, data is extracted from source systems, transformed into a consistent data type, and then loaded into a single repository. During ETL testing, data is validated, verified, and qualified to prevent duplicate records and data loss.
The ETL testing ensures that data is transferred from heterogeneous sources to the central data warehouse in compliance with all transformation rules and validation checks. This is different as opposed to the data reconciliation used in database testing, ETL testing is applied to data warehouse systems to obtain relevant information for analytics and business intelligence.
13. Explain the data warehouse distribution process
The data warehouse distribution process involves the following subsets of processes:
Step 1: Determine Business Objectives
Learn the requirements and objectives of the business and its owners and stakeholders and convert those requirements into quantifiable form. These quantifiable key performance indicators will then be your business objectives.
Step 2: Collect and Analyze Information
Ask questions, acquire various sales, CRM, and marketing reports(as needed), and then consolidate all this information into a structured format
Step 3: Understand core business processes
Once you have the data, you will be able to better understand the business processes and come up with core objectives you need to solve with data warehousing
Step 4: Initiate a conceptual data model
Create a conceptual model of the data after identifying the business processes. The next goal will be to determine the subjects expressed as fact tables and dimensions relating to the facts.
Step 5: Locate data sources and plan data transformations
Analyze core data sources and set to work on building the data pipelines and transformation models
Step 6: Set tracking duration
To process large amounts of data you need to optimize the time requirements. Since data warehouses track performance over time, the data should be available virtually forever. So a solution should be accounted for that.
Step 7: Start final implementation
Once the plan is developed, start allocating resources and scheduling tasks to get things in the pipeline.
14. What is normalization?
Normalization also referred to as Database normalization, is the process of rearranging or organizing the columns and the tables that are associated with a relational database. This reduces data redundancy and improves data visibility.
Moreover, this process simplifies the database design so that the optimal structure can be achieved. In short, it helps split tables into additional data thus improving data interoperability and at the same time making it easy while retrieving the data.
15. What is a fact table? Explain how many fact tables are there is a star schema?
Fact tables contain information about measurements, facts, and metrics of business processes. Usually, it sits in the center of a star schema, it is also known as a snowflake schema. Usually, a fact table consists of two types of columns:
- Fact data
- Foreign key relations
Only one of these fact tables is stored in the star schema or snowflake schema. So, multiple fact tables are stored under the fact constellation schema.