- ETL with Azure Cookbook
- Christian Cote Matija Lah Madina Saitakhmetova
- 1471字
- 2021-06-11 18:02:40
Creating a simple ETL package
This recipe will show you how to build a simple ETL package that does the following:
- Extract data from the AdventureWorksLT SalesLT.Customer table.
- Transform data: Concatenating all customer-name-related information into a single column called FullName.
- Load the data into a new table called SalesLT.CustomerFullName.
The table we create will be used in the final recipe in this chapter, Loading data before its transformation, to explain ELT.
Getting ready
This recipe assumes that you have installed SSMS, Visual Studio 2019, and SSIS.
How to do it…
Let's begin by opening Visual Studio 2019:
- Select Create a new project from the main screen.
- Select Integration Services Project from the screen that appears, as shown in the following screenshot:
- As shown in the following screenshot, name the project AdventureWorksETL and select a folder for it. Click Create to complete the project creation:
- The project is created and opened, along with a package named Package.dtsx. Right-click in the Connection Managers section at the bottom of the package and select New OLE DB Connection… as shown in the following screenshot:
- Click on New…. Enter the database name, select SQL Server Authentication, and type the username and password. Select AdventureWorksLT as the database name and click on OK, as shown in the following screenshot:
- Back on the connection manager configuration window, click on OK to complete the connection manager creation, as shown in the following screenshot:
- Click on the F2 function key on your keyboard to rename the connection manager to cmgr_etlinazurecookbook.database.windows.net.AdventureWorksLT.ETLAdmin. We simply prefix the name with cmgr_. This is a standard naming convention in SSIS that allows package creation standards.
- Right-click on the connection manager and select Convert to Project Connection from the contextual menu, as shown in the following screenshot:
Having a project connection manager makes the connection available on every package we create from now on. The connection manager will simply be present on all the packages in the project so we will not need to add a new connection, repeating the preceding steps, every time we add a new package to the project.
- From the SSIS toolbox on the left, drag and drop a Data Flow Task onto the package surface. Name it DFT_SalesLT_CustomerFullName. This is a naming convention that indicates this dataflow is processing data for SalesLT.CustomerFullName.
- Double-click on the Data Flow Task and add an OleDB data source from the Other sources section of the SSIS toolbox. Call it OLEDBSRC_SalesLT_Customer. This indicates that the source we are using is an OLEDB source, and extracts data from the SalesLT.Customer table.
- Double-click on the OLEDB connection manager and set it to the one we created before. Set Data access mode to SQL Command. Then type the following SQL statement inside the SQL command textbox:
SELECT [CustomerID]
,[NameStyle]
,[Title]
,[FirstName]
,[MiddleName]
,[LastName]
,[Suffix]
FROM [SalesLT].[Customer]
Click on Preview to ensure that the query works correctly. Click on OK to close the OLE DB source component.
- Drag and drop a Derived Column transform onto the Data Flow and attach it to the OLE DB source created earlier. Rename it DER_FullName, indicating that this is adding a new column called FullName to the pipeline.
- Double-click on it to open it. In the Derived column name field, type FullName. In the Expression field, type the following expression and click OK:
TRIM(REPLACENULL(Title,"") + " " + REPLACENULL(FirstName,"") + " " + TRIM(REPLACENULL(MiddleName,"") + " ") + REPLACENULL(LastName,"") + " " + REPLACENULL(Suffix,""))
- From Other Destinations in the SSIS toolbox, add an OLE DB destination to the Data Flow and attach it to the DER_FullName derived column transform created in Step 13 of this recipe. Rename it OLEDBDST_SalesLT_CustomerFullName to indicate that we're loading data into the SalesLT.CustomerFullName table.
- Double-click on it to open it. Make sure the OLEDB connection manager is set to the one we created before and Table access mode is set to Table or view – fast load.
- Click on the New… button beside Name of the table or the view and set the properties as shown in the following screenshot:
- Make sure to use the table name [SalesLT].[CustomerFullName]. Failure to do so will fail other steps of the recipe. The SQL DDL script to create the table is the following:
CREATE TABLE [SalesLT].[CustomerFullName] (
[CustomerID] int,
[NameStyle] bit,
[Title] nvarchar(8),
[FirstName] nvarchar(50),
[MiddleName] nvarchar(50),
[LastName] nvarchar(50),
[Suffix] nvarchar(10),
[FullName] nvarchar(350)
)
- Click on OK to close the Create Table window and return to the previous screen. Click on Mappings on the left of OLE DB Destination Editor. Your screen should look like the following screenshot. Click on OK to close the editor and return to the Data Flow Task:
- To correctly align the Data Flow content, right-click on the taskbar and select Layout from the submenu as shown in the following screenshot:
- Now, select all items in the Data Flow task (Ctrl + A) and click on the Make same width icon on the Layout toolbar.
- For the final finishing touch, we'll auto-align all transforms on the Data Flow. From the Format menu, select Auto Layout and Diagram as shown in the following screenshot:
The Data Flow should now look like the following screenshot:
- Right-click on the arrow (path) between the DER_FullName transform and OLEDBDST and select Enable Data Viewer as shown in the following screenshot:
- Close the Data Viewer. Now, right-click anywhere on the Data Flow task background and select Execute Task as shown in the following screenshot:
The Data Flow Task should execute successfully, and you should see the Data Viewer with the new column added as shown in the following screenshot:
- Now, right-click on the path between the derived column and destination again to Disable Data Viewer as we don't need it anymore. Stop the execution of the Data Flow by clicking the Stop button.
- As a final step, we'll make our package idempotent, that is, making sure that anytime we execute it, the destination data will be the same.
Go back to the package's control flow and add an Execute T-SQL task from the Favorite Tasks section of the SSIS toolbox. Double-click on it to open it. Set the properties as follows:
a) Name: SQL_Truncate_SalesLT_CustomerFullName
b) Connection: the connection manager of the package
c) SQLSourceType: Direct Input
d) SQL Statement: TRUNCATE TABLE SalesLT.CustomerFullName
Click on OK to close Execute SQL Task Editor. Back on the control flow, attach the Data Flow task to the Execute SQL Task just created.
- Like we did for the Dataflow items, select both tasks and, from the Layout toolbar, click on Make same width. From the Format menu, select, Format > Auto Layout > Diagram. Your package should look like the following screenshot:
- Save the package, right-click on the package in the Solution Explorer (at the right of Visual Studio), and select Rename from the submenu. Rename the package to SalesLT.CustomerFullName.dtsx.
- Finally, execute the package. Again, right-click on it and select Execute package from the submenu that appears. The package will execute, and your screen should look like the following screenshot:
How it works…
This recipe showed a simple ETL design pattern. Most ETL packages are more complex but they usually follow the same pattern: Source > Transformation > Load into a destination. The transformation phase is quite simple: it uses an expression to concatenate the name-related columns, taking care of the fact that the column may be NULL. The outer TRIM command takes care of the space at the front and end of the column because the Title and Suffix columns might be NULL. It is the same for the inner TRIM column that is taking care of the NULL MiddleName column.