SQL Server商业智能–在已建立的数据仓库中扩展事实表

最新推荐文章于 2023-09-06 12:08:18 发布

翻译最新推荐文章于 2023-09-06 12:08:18 发布 · 406 阅读

1 ·

本内容遵循CC 4.0 BY-SA版权协议

原文链接：https://www.sqlshack.com/sql-server-business-intelligence-expanding-fact-tables-established-data-warehouse/

本文介绍了一种在现有数据仓库事实表中安全添加新列的方法，通过条件检查避免重复更新，确保数据一致性。此方法适用于OLTP和数据仓库不在同一服务器的情况。

介绍 (Introduction)

在生活中，听起来简单的事情并非总是最简单的事情。在计算机科学中，情况更是如此。实际上，信息技术工作中最具挑战性的事情之一通常是与老板之间的沟通，这些老板对技术知之甚少，并且需要为看似简单的任务花费的时间辩护。但是，出于同样的原因，未经训练的人似乎不可能完成的任务通常非常简单，易于实施，并且可以使您容易受到尊重。

扩展事实表 (Expanding fact tables)

Adding a column to an existing fact table seems like a fairly simple task. It is also often a legitimate business requirement. For example, if an organization wants to investigate the utility of installing new and more expensive credit card facilities, decision makers would probably need to be able to analyze payment types on their sales fact. To do so, provided the information exists in the OnLine Transaction Processing (OLTP / Production) database one would simply need to add a “payment type” dimension and a “PaymentTypeKey” column to the fact table (in a traditional star schema model).

在现有事实表中添加列似乎是一项相当简单的任务。这通常也是合法的业务要求。例如，如果组织希望调查安装新的和更昂贵的信用卡设施的实用性，则决策者可能需要能够根据其销售事实来分析付款类型。为此，只要在线交易处理（OLTP /生产）数据库中存在该信息，则只需在事实表中添加“付款类型”维和“ PaymentTypeKey”列（在传统的星型模式模型中）。

This schema change is fairly simple but if, as is that case in many systems, your data warehouse and OLTP database are not on the same server/instance or your solution needs to be deployable you will not be able to update the table and fill the new column with data directly from your OLTP source in one T-SQL statement. Likewise, due to the differential nature of ETL updates even once you have added a new look up and fed the new column only the newest rows will have this column filled.

模式更改非常简单，但是，如果像在许多系统中那样，您的数据仓库和OLTP数据库不在同一服务器/实例上，或者您的解决方案需要可部署，则将无法更新表并填充表。在一个T-SQL语句中直接包含来自OLTP源的数据的新列。同样，由于ETL更新的差异性，即使您添加了新的查询并输入了新的列，也只有最新的行会填充此列。

Dialog showing that only the newest rows will have the PaymentTypeKey column filled

It is also very often unacceptable to entirely truncate a fact table and ETL log (causing it to rerun entirely) because fact tables can be huge and this would clog up the server.

完全截断一个事实表和ETL日志（使它完全重新运行）通常也是不可接受的，因为事实表可能很大，这会阻塞服务器。

The following method provides a solution that can be tacked on to an existing fact ETL flow. To learn how to create a basic fact ETL flow you can check out an earlier article on the matter here.

以下方法提供了可以附加到现有事实ETL流程的解决方案。要了解如何创建基本事实ETL流，您可以在此处查看有关此问题的早期文章。

Assuming you have already set up views on your OLTP database which reflect the tables in your data warehouse the basic steps are as follows:

假设您已经在OLTP数据库上设置了可反映数据仓库中表的视图，基本步骤如下：

Alter your view on the OLTP database or change the query in the OLE DB Source of your original SSIS dataflow to include the ID of reference table (in our case PaymentTypeID)
更改您在OLTP数据库上的视图或更改原始SSIS数据流的OLE DB源中的查询，以包括引用表的ID（在我们的情况下为PaymentTypeID）
Add a new view on your previously excluded reference table and create the table along with its own ETL in the data warehouse. Or, if you are not using the OLTP view method you can just create a new ETL package and write this query in the OLE DB Source instead of selecting from your view.
在您先前排除的参考表上添加一个新视图，并在数据仓库中创建该表及其自身的ETL。或者，如果您不使用OLTP视图方法，则只需创建一个新的ETL包并在OLE DB源中编写此查询即可，而不是从视图中进行选择。
Alter the original data flow task for your fact table to include the new column (I did an extra lookup too because it’s better practice to include the data warehouse’s surrogate key instead of the OLTP ID)
更改事实表的原始数据流任务以包括新列（我也做了额外的查找，因为更好的做法是包括数据仓库的代理键而不是OLTP ID）
Create a conditional test that checks if the new column has been updated in the past and either reverts to the old dataflow or first updates the existing data.
创建一个条件测试，以检查新列在过去是否已更新，并且恢复到旧数据流还是首先更新现有数据。

The first three points are out of scope for this article but the 4^th is covered in detail below.

第一三点超出范围本文但^第被覆盖在下面详细的4。

创建条件ETL路径 (Creating the conditional ETL path)

It is worth noting that these steps can be used in any conditional split setup and can come in handy in many situations, it is basically the equivalent logic of an IF statement but in ETL format.

值得注意的是，这些步骤可以在任何条件拆分设置中使用，并且在许多情况下都可以派上用场，它基本上是IF语句的等效逻辑，但采用ETL格式。

The following figure shows the new module tacked on to the old existing fact table ETL:

下图显示了附加到旧的现有事实表ETL上的新模块：

Figure showing the new module tacked on to the old existing fact table ETL

What it does is:

它的作用是：

Check if the new column has any value but -1 (undefined).
检查新列是否具有除-1（未定义）之外的任何值。
Create a staging table in the data warehouse existing of the fact table key along with the new PaymentTypeKey.
在事实表键和新的PaymentTypeKey存在的数据仓库中创建一个临时表。
Select the fact table data from the OLTP view and insert the necessary data in the staging table in the data warehouse.
从OLTP视图中选择事实表数据，然后在数据仓库的登台表中插入必要的数据。
Update the fact table to feed the new column using the staging table as a reference.
更新事实表以使用登台表作为参考来提供新列。
Move on to the old data flow.
转到旧的数据流。

Steps 2 to 4 are never run if the column is already fed.

如果已经送入色谱柱，则从不执行步骤2至4。

To set up this type of solution follow the steps below:

要设置这种解决方案，请按照以下步骤操作：

Create the Execute SQL task for the test.

创建测试的执行SQL任务。

This task should run a check query like this:

此任务应运行如下检查查询：
```
 
IF EXISTS (SELECT
	1
FROM FactResellerSales
WHERE PaymentTypeKey <> '-1') 
SELECT
	1 AS ColumnIsPopulated 
ELSE 
SELECT
	0 AS ColumnIsPopulated
            
```
This gives you a Boolean value which you can assign to a package variable on which you can direct the flow of the package. The result set is set up as follows:

这为您提供了一个布尔值，您可以将其分配给包变量，在该变量上可以指导包的流程。结果集设置如下：
Once you are done, drag in another Execute SQL task that will be used for the creation of the staging table. However before you start that, drag an output to the new task and double click on the arrow to open its properties. From here you can change the “Evaluation operation” from “Constraint” to “Expression” and use the following expression (don’t forget to change the “Multiple constraints” value to a “Logical OR” which implies the ETL will take either this output or the other):

完成后，拖入另一个将用于创建登台表的Execute SQL任务。但是，在开始之前，请将输出拖动到新任务，然后双击箭头以打开其属性。在这里，您可以将“评估运算”从“约束”更改为“表达式”，并使用以下表达式（不要忘记将“多个约束”值更改为“逻辑或”，这意味着ETL将采用这两种方法）输出或其他）：
You can now configure the Execute SQL task to create the staging table:

现在，您可以配置Execute SQL任务来创建登台表：

I like to use an IF NOT EXISTS statement to ensure the script is only run if the table does not exist. This is just to ensure the ETL can be stopped more easily at any moment and rerun without errors. Here is the script:

我喜欢使用IF NOT EXISTS语句来确保仅在表不存在时才运行脚本。这只是为了确保ETL随时可以更轻松地停止并重新运行而不会出错。这是脚本：
```
 
IF NOT EXISTS (SELECT
	1
FROM sys.tables
WHERE name = 'TempETLFactResellerSales') 
CREATE TABLE TempETLFactResellerSales(SalesOrderNumber varchar(20), SalesOrderLineNumber tinyint, PaymentTypeKey int)
        
```
The next task is a Data Flow Task. The source selects only the key fact primary key and the new column key from a view on the OLTP database:

下一个任务是数据流任务。源仅从OLTP数据库上的视图中选择关键事实主键和新列键：
```
 
SELECT
	SalesOrderNumber,
	SalesOrderLineNumber,
	PaymentTypeID
FROM ViewFactResellerSales
WHERE PaymentTypeID IS NOT NULL
        
```
The lookup task find the PaymentTypeKey from the PaymentTypeID (these are not always the same as surrogate keys are often preferred in denormalized data warehouses). The Lookup task is setup to select the PaymentTypeKey and PaymentTypeID from DimPaymentType in the “Connection” screen and join on PaymentTypeID on the “Columns” screen as follows:

查找任务从PaymentTypeID中找到PaymentTypeKey（它们并不总是相同，因为在非规范化数据仓库中通常首选替代密钥）。查找任务已设置为从“连接”屏幕中的DimPaymentType中选择PaymentTypeKey和PaymentTypeID，并在“列”屏幕上加入PaymentTypeID，如下所示：

The last step is an OLE DB Source that inserts that lookup output into the staging output with the following mapping:

最后一步是一个OLE DB源，它使用以下映射将查找输出插入到临时输出中：

N.B. You must first run the CREATE script of the temporary staging table to be able to set up this mapping in the UI. Also, in order to run the package it is important to change the “ValidateExternalMetadata” value of the OLE DB Source task to “False”. This allows the package to run even when the table does not yet exists (this is normal as it is created in the previous step)

注意：必须首先运行临时登台表的CREATE脚本，才能在UI中设置此映射。另外，为了运行程序包，将OLE DB Source任务的“ ValidateExternalMetadata”值更改为“ False”很重要。即使表尚不存在，这也允许程序包运行（这是在上一步中创建的正常状态）
The final task to create is another Execute SQL task that will run the update of the fact table using the staging table as a reference. It is a simple Execute SQL task on the data warehouse that runs the following query:

创建的最后一个任务是另一个Execute SQL任务，它将使用登台表作为参考来运行事实表的更新。这是运行以下查询的数据仓库上的简单Execute SQL任务：
```
 
UPDATE Fact
SET Fact.PaymentTypeKey = Staging.PaymentTypeKey
FROM FactResellerSales AS Fact
JOIN TempETLFactResellerSales AS Staging
	ON Fact.SalesOrderNumber = Staging.SalesOrderNumber
	AND Fact.SalesOrderLineNumber = Staging.SalesOrderLineNumber
GO
 
DROP TABLE TempETLFactResellerSales
GO
        
```
After this query runs your fact table should have its new column you should now have a dataflow that looks something similar to the screenshot to the left. You will have to join up to the old Data Flow Task to finish up.

运行此查询后，事实表应具有其新列，您现在应具有一个看起来类似于左侧屏幕快照的数据流。您将必须加入旧的数据流任务才能完成。

To do so, drag the output of the “Update Fact Table” task to the “Original Data Flow Task”.

为此，请将“更新事实表”任务的输出拖动到“原始数据流任务”。

Next you need to click on the conditional “Check if column is seeded” task and drag a second output directly to the “Original Data Flow Task” and edit the properties to test the “ColumnIsSeeded” variable and direct to the old Data Flow Task when it’s value is greater than zero as below.

接下来，您需要单击条件“检查列是否已播种”任务，并将第二个输出直接拖动到“原始数据流任务”，然后编辑属性以测试“ ColumnIsSeeded”变量，并在出现以下情况时直接定向到旧的数据流任务它的值大于零，如下所示。

If you do this correctly the outputs of both tasks above the “Original Data Flow Task” should become dotted, this signifies that the flow of the package will follow either one path or the other.

如果正确执行此操作，则“原始数据流任务”上方的两个任务的输出都应变为虚线，这表示包的流将遵循一条路径或另一条路径。

Note, if your old package has multiple starting tasks you can envelope them in a “Sequence Container” to kick off the entire sequence instead of merely the “Original Data Flow Task”

请注意，如果您的旧程序包有多个启动任务，则可以将它们封装在“序列容器”中以启动整个序列，而不仅仅是“原始数据流任务”

结论 (Conclusion)

This solution provides a way of safely updating a new column that has been added to an existing fact table. The conditional check allows the flow of the package to update column where there is existing data or just move to the old data flow task if the data has already been populated. With that in mind, the new branch of data flow will logically only run once. Once you have successfully run the ETL you can delete or disable the steps and just run the “Original Data Flow Task” as you did before.

该解决方案提供了一种安全地更新已添加到现有事实表中的新列的方法。有条件的检查允许包的流更新存在数据的列，或者如果已经填充了数据，则仅移至旧的数据流任务。考虑到这一点，新的数据流分支在逻辑上将只运行一次。成功运行ETL后，您可以删除或禁用这些步骤，然后像以前一样运行“原始数据流任务”。

有用的资源： ( Useful resources:)

Designing Star Schema
Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals
Extending the Data Warehouse Database

设计星型架构
 数据仓库基础知识：面向IT专业人员的综合指南
 扩展数据仓库数据库

翻译自: https://www.sqlshack.com/sql-server-business-intelligence-expanding-fact-tables-established-data-warehouse/

标签

#数据库 #java #python #mysql #大数据