This article is mainly to record some of my summary about data in my first job.
Why we need the data warehouse
When the scale of data is relatively small, database is a general way to store and read the data. Common relational databases include MySQL, SQL Server and PostgreSQL. However, as the data grows, there will be some bottlenecks when using the database.
- As the volume of data increases, the efficiency of query is gradually slow.
- There are too many tables when executing the query, and due to the constraint for the normal form, in many queries we need to do at least one time join operation, which is time-consuming.
- There may be exist some illegal fields in the table, which would affect the query result
When meeting these bottlenecks, we can consider to use data warehouse to solve these problems. In my first job I got in touch with offline data warehouse, so ‘data warehouse’ means the offline data warehouse if not have other specific explanations.
One of the difference between offline data warehouse and database is that database should support the real time CRUD operation simultaneously, other than offline data warehouse mainly focus on the data reading. Because data warehouse mainly stores the historical data, in real environment, the historical data would be written into data warehouse one or two times, or several days one time. It causes that data warehouse doesn’t have too much requirement for the timeliness. (In general the data are stored in data warehouse in the early morning and can use the tools like Airflow to do the schedule) Therefore, the design for the data warehouse tends to the efficiency of reading data.
For instance, at that time we use Parquet as the storage format in data warehouse. Parquet is column-oriented, unlike row-oriented for database. When doing the query to read specific columns, Parquet only needs to load these columns other than all columns in the table, which can save the time for loading. On the contrary, for the data writing, column-oriented storage will spend more time because when writing data it needs to modify multiple columns.
In addition, the tables in data warehouse are allowed to be redundant and don’t need to consider the constraint of normal forms. So data warehouse can reduce the frequency of join operation when doing the query, reduce the time for join and increase the efficiency for reading. Because update operation can rarely happen in data warehouse, it is not necessary for us to worry that the value update for one column would influence other tables.
In summary, database can do many things that you would like to do, but the efficiency for it is not good in large dataset. And data warehouse is specialized in the data reading, so it has the higher efficiency when reading.
The whole process that data stores in data warehouse
Database focus on OLTP(Online Transaction Process), and data warehouse focus on OLAP(Online Analytical Process). Therefore, data warehouse is to read the historical data and do the related analysis. It has several layers (from button to top), STG, ODS, DWD, DWS, ADS, and the DIM layer which covers multiple data warehouse layers.
1. Read data from source (ODS/STG)
ODS or STG+ODS layer is to receive the data from data source and this layer is the entry for the data warehouse. The source of data includes several ways,
(1)The data which are generated in the application. (User action log), such as the link that user accessed in the application, or the result when user finishs playing a game. These data will be collected and sent to server predically or be triggered by some user actions. Due to the large amount of data in this aspect, the most parts of them are not stored in database and directly ready for entering the data warehouse.
(2)The data related to database, which can be read by connecting the database.
(3)External data (such as the data in other websites, or the data got by web script)
In common, the data are not be processed in this stage, or only done some data format transformation operation. To avoid influencing the daily usage of server, the operation that writing data into data warehouse usually happens in the early morning.
2. Raw Data Process(DWD)
DWD is data warehouse detail layer, and this layer mainly stores the data which have been transformed, and keeps the detailed information of them. For the transformation, I think that it can be divided to two parts, one is the transformation on the table layer, and other is for every row of data.
We can classify data into different data marts according to the business line that the data belong to.
2.1 The transformation on the table layer
This part of transformation is to solve the problem that there are too many raw tables in database and the query is complicated. For the same business object, the tables may be divided to multiple similar tables with different suffixes based on different regions, like tablename_sg, tablename_my, tablename_cn and so on. In this situation if analyst needs to make a query to get the information in all regions, without data warehouse he or she may need to do the large amount of union operation. Therefore, it is necessary to do the transformation on the table layer.
One method for transformation is to combine multiple tables into one table, and add a column to explain the raw table this row data belong to.
And the other situation is that analyst needs to know the data for one business object but there are not full data in the raw table. And in this scene the transformation is to join more than one table (can be raw table or other tables have exists in data warehouse) to generate a new table on DWD layer based on the requirement for the user.
2.2 The transformation on every row data
This kind of transformation covers these aspects,
- Check if the data are legal, if not legal and need to modify or delete this row data
- The transformation based on the mapping
(1)In raw table one column uses integer to store specific information (like 1 means male, and 2 means female) , and to make the analysis more convenient, we can do the transformation from integer to string based on the specific mapping.
(2)The transformation between integer and integer, such as the exchange for money.
2.3 The way to make sure whether the process is correct
My experience is as follows,
(1) Do the unit test when developing, and see if test output satisfies what we want
(2) In DQC(Data quality check), check whether the table shows the content what we hope (such as the new column or the transformation in DWD layer)
(3) Do the count(*) in both raw table and DWD table in the same specific condition, and observe whether the num is equal. If not equal, it means the transformation is inconsistent and needs to be checked where the root cause is.
3. Summary the data(DWS)
Compared with the detailed data, in many times our analysts are more interested in the summary data, such as the total players in one day for a game, or the rate between male players and female players. So in this situation we can use the Data Warehouse Layer(DWS). This layer is to summary the data in the DWD table.
The common operations in DWS table include the aggregation operation like sum, count, avg in SQL, and the join operation between different tables in data mart or the whole data warehouse.
The above is some of my understanding and summary about data and the process that data enter data warehouse. Data warehouse can solve the bottleneck for database when the data keep growing, and make the query and analysis more convenient, which can improve the whole efficiency on a company.