SSIS (SQL Server Integration Services) is an upgrade of DTS (Data Transformation Services), which is a feature of the previous version of SQL Server. SSIS packages can be created in BIDS (Business Intelligence Development Studio). These can be used to merge data from heterogeneous data sources into SQL Server. In fact, they can also be used to populate data warehouses, to clean and standardize data, and to automate administrative tasks.
Generally, integration services provide a platform to build data integration and workflow applications. The primary use for SSIS is data warehousing providing a fast and flexible tool for data extraction, transformation, and loading (ETL). The tight integration with other Microsoft SQL family and the rich studio environments adds more reasons to adopt such a solution. Furthermore, it can also automate maintenance of SQL Server databases, update multidimensional cube data, robust event and error handling and perform other functions.
The data transformation and workflow creation of SSIS can be summarized in 3 major components:
- Operational Data: These are databases where its data is mostly used in the current operations and exist in many formats and technologies such as MySQL, PostgreSQL, Oracle Database, IBM DB2, SAP ASE, Azure SQL or any other. The term OLTP (Online transaction processing) databases are often used interchangeability in the domain of data engineering and warehousing. As operational databases support the transaction-oriented applications consist. Additionally, these databases are stored in separate locations with different versions which further complicates the integration part. Thus, the transfer to the data warehouse for better analysis and longer-term storage becomes more feasible.
- ETL process: It is the actual process of heavy lifting to first extract the data from various sources(homogeneous or heterogeneous). Then to transform (cleaning, formatting, coding, aggregating, filtering, ordering & validating) it to meet your requirement needs. Finally, load into a target data warehouse(s) for further and easier analysis with minimal resources.
- Data Warehouse: The home to a large set of data accumulated after proper transformations as the output of the ETL process. Its main purpose is to help in quickly and effectively answering business questions for more accurate decision making. The term OLAP (Online Analytical Processing) database is often referred to on the data-warehouse side.
In this tutorial, we will be learning how to integrate data found in flat text file -which can be generalized to any source data type as will be seen later- into an already created warehouse. The creation of the warehouse is out of the scope of this post.
We need the following list to be installed:
- Microsoft SQL Server
- Microsoft SQL Server Management Studio (SSMS)
- Microsoft SQL Server Data Tools (SDT)
- Sample warehouse database AdventureWorksDW2012
OLAP Data Warehouse Installation
After downloading the zip file “AdventureWorksDW-data-warehouse-install-script.zip“:
- Unzip the file to a path of your choice
- Open SSMS and connect to the database server you already setup while installing
- Click new query and copy to the contents of instawdb.sql file found in unzipped data in step 1. Make sure before you execute the query to the menu in SSMS and select SQLCMD mode in order to correctly run the script.
- Click execute query where it will create the database with name “AdventureWorksDW“, create all tables prefixed with Dim (short for dimension which is a collection of reference information about a measurable event usually collected from different sources to form cubes). Also, some sample data will be inserted in the tables as well.
- In SSMS, right-click Databases and click refresh to inspect if your database is installed correctly or not.
In case you get this error “The code page 65001 is not supported by the server. (Microsoft SQL Server, Error: 2775)“ while inserting the bulk data to the tables from the included csv files. Go to control panel -> region settings -> Language for non-unicode programs -> System Locale and un-check the box Beta:Use Unicode UTF-8 for worlwide Language support. This should resolve the issue.
In this tutorial, we will learn how to create a simple ETL package that does the following:
- Extracts data from a single flat file source.
- Transforms the data using two lookup transformation components
- Writes that data to the FactCurrencyRate fact table in AdventureWorksDW warehouse database
Here is the sample currency flat file.
The file has four columns: the average rate of the currency, a currency key, a date key, and the end-of-day rate. When working with flat file source data, it’s important to understand how the Flat File connection manager interprets the flat file data. If the flat file source is Unicode, the Flat File connection manager defines all columns as [DT_WSTR] with a default column width of 50. If the flat file source is ANSI-encoded, the columns are defined as [DT_STR] with a default column width of 50. Note, you probably need to change these defaults to make the string column types more applicable for your data. You need to look at the data type of the destination, and then choose that type within the Flat File connection manager.
The destination is the FactCurrencyRate fact table in AdventureWorksDW. The FactCurrencyRate fact table has four columns AverageRate (float), CurrencyKey(int (FK to DimCurrency)),DateKey (int (FK to DimDate) and EndOfDayRate (float).
On the other hand, the destination data formats indicate that lookups are necessary for CurrencyKey and DateKey values. The transformations that perform these lookups get those values by using the alternate keys from the DimCurrency and DimDate dimension tables.
|Flat file Column||Table Name||Column Name||Data Type|
1-Create a new Integration Services project
- On the Start menu, point to All Programs, point to Visual Studio (SSDT).
- On the File menu, point to New, and click Project to create a new Integration Services project and give it a name of your choice.
- By default, an empty package, titled Package.dtsx, will be created and added to your project.
2-Add and configure a Flat File connection manager
- Right-click anywhere in the Connection Managers area, and then click New Flat File Connection
- In the Flat File Connection Manager Editor dialog box, for Connection manager name, type Sample Flat File Source Data and browse to the “SampleCurrencyData.txt”
- From the advanced tab, rename columns as Column 0 to AverageRate, Column 1 name property to CurrencyID, Column 2 to CurrencyDate and Column 3 to EndOfDayRate.
- Next, we need to change the suggested data types by the connection manager to types that match the destination table as explained before. Change AverageRate and EndOfDayRate to DT_R4, CurrencyID to Unicode string [DT_STR] and CurrencyDate to timestamp [DT_DBDATE].
3-Add and configure an OLE DB connection manager
After adding the source, we need to task is to add an OLE DB connection manager to connect to the destination.
- Right-click anywhere in the Connection Managers area, and then click New OLE DB Connection.
- In the Configure OLE DB Connection Manager dialog box, click New.
- For Server name, enter your server name the one you already setup before.
- In the select database box choose “AdventureWorksDW” and then press Test to make sure the connection is appropriately setup.
- From properties, rename the connection to “Destination DB”
4-Add a Data Flow task to the package
After creating the connection managers for the source and destination data, the next step is to add a Data Flow task to the package. The Data Flow task encapsulates the data flow engine that moves data between sources and destinations and provides the functionality for transforming, cleaning, and modifying data as it is moved.
- Click the Control Flow tab.
- In the Toolbox, expand Control Flow Items, and drag a Data Flow Task onto the design surface of the Control Flow tab.
- On the Control Flow design surface, right-click the newly added Data Flow Task, click Rename, and change the name to Extract Sample Currency Data.
5-Add and configure the flat file source
In this step, you will add and configure a Flat File source to your package. A Flat File source is a data flow component that uses metadata defined by a Flat File connection manager to specify the format and structure of the data to be extracted from the flat file by a transform process.
- Double-clicking the Extract Sample Currency Data data flow task
- In the Toolbox, expand Data FlowSources, and then drag a Flat File Source onto the design surface of the Data Flow tab.
- On the Data Flow design surface, right-click the newly added Flat File Source, click Rename, and change the name to Extract Sample Currency Data
- Double-click the Flat File source to open the Flat File Source Editor dialog box
- In the Flat file connection manager box, type or select Sample Flat File Source Data
- Click Columns and verify that the names of the columns are correct
6-Add and configure the lookup transformations
The next step is the transformation part of the ETL process as explained previously. Now we need to define the Lookup transformations needed to obtain the values for the CurrencyKey and TimeKey. The Lookup transformation performs the lookup by joining data in the specified input column to a column in a reference dataset. The reference dataset can take many forms such as an existing table, view, a new table, or the result of an SQL statement. In our tutorial, we will form 2 transformations:
a) To lookup of values from the CurrencyKey column of the DimCurrency dimension table based on matching CurrencyID column values from the flat file.
- In the Toolbox, expand Data Flow Transformations, and then drag Lookup onto the design surface of the Data Flow tab. Place Lookup directly below the Extract Sample Currency Data source.
- Click the Extract Sample Currency Data flat file source and drag the green arrow onto the newly added Lookup transformation to connect the two components.
- On the Data Flow design surface, click Lookup in the Lookup transformation, and change the name to Lookup Currency Key.
- Double-click the Lookup CurrencyKey transformation.
- On the General page, make the following selections:
- Select Full cache.
- In the Connection type area, select OLE DB connection manager.
- On the Connection page, make the following selections:
- In the OLE DB connection manager dialog box, ensure that AdventureWorksDW is displayed.
- Select Use results of an SQL query, and then type or copy the following SQL statement:
select * from (select * from [dbo].[DimCurrency]) as refTable
where [refTable].[CurrencyAlternateKey] = 'ARS'
OR [refTable].[CurrencyAlternateKey] = 'AUD' OR [refTable].[CurrencyAlternateKey] = 'BRL' OR [refTable].[CurrencyAlternateKey] = 'CAD' OR [refTable].[CurrencyAlternateKey] = 'CNY' OR [refTable].[CurrencyAlternateKey] = 'DEM' OR [refTable].[CurrencyAlternateKey] = 'EUR' OR [refTable].[CurrencyAlternateKey] = 'FRF' OR [refTable].[CurrencyAlternateKey] = 'GBP' OR [refTable].[CurrencyAlternateKey] = 'JPY' OR [refTable].[CurrencyAlternateKey] = 'MXN' OR [refTable].[CurrencyAlternateKey] = 'SAR' OR [refTable].[CurrencyAlternateKey] = 'USD' OR [refTable].[CurrencyAlternateKey] = 'VE
- On the Columns page, make the following selections
- In the Available Input Columns panel, drag CurrencyID to the Available Lookup Columns panel and drop it on CurrencyAlternateKey.
- In the Available Lookup Columns list, select the checkbox to the right of CurrencyKey then click OK to return to the Data Flow design surface.
b) To lookup of values from the TimeKey column of the DimTime dimension table based on matching CurrencyDate column values from the flat file.
- In the Toolbox, drag Lookup onto the Data Flow design surface. Place Lookup directly below the Lookup CurrencyKey transformation.
- Click the Lookup Currency Key transformation and drag the green arrow onto the newly added Lookup transformation to connect the two components.
- In the Input Output Selection dialog box, click Lookup Match Output in the Output list box, and then click OK.
- On the Data Flow design surface, click Lookup in the newly added Lookup transformation, and change the name to Lookup Date Key.
- Double-click the Lookup Date Key transformation.
- On the General page, select Partial cache.
- On the Connection page, make the following selections:
- In the OLEDB connection manager dialog box, ensure that AdventureWorksDW is displayed.
- In the Use a table or view box, type or select [dbo].[DimDate].
- On the Columns page, make the following selections:
- In the Available Input Columns panel, drag CurrencyDate to the Available Lookup Columns panel and drop it on FullDateAlternateKey.
- In the Available Lookup Columns list, select the check box to the right of TimeKey.
- On the Advanced page, review the caching options.
- Click OK to return to the Data Flow design surface
7-Add and configure the OLE DB destination
The package now can extract data from the flat file source and transform it into a format that is compatible with the destination. The next step is to actually load the transformed data into the destination. To load the data, you must add an OLE DB destination to the data flow.
- In the Toolbox, expand Data Flow Destinations, and drag OLE DB Destination onto the design surface of the Data Flow tab. Place the OLE DB destination directly below the Lookup DateKey transformation.
- Click the Lookup Date Key transformation and drag the green arrow over to the newly added OLE DB Destination to connect the two components together.
- In the Input-Output Selection dialog box, in the Output list box, click Lookup Match Output, and then click OK.
- On the Data Flow design surface, click OLE DB Destination in the newly added OLE DB Destination component, and change the name to Sample OLE DB Destination.
- Double-click Sample OLE DB Destination.
- In the OLE DB Destination Editor dialog box, ensure that AdventureWorksDW is selected in the OLE DB Connection manager box.
- In the Name of the table or the view box, type or select [dbo].[FactCurrencyRate].
- Click Mappings.
- Verify that the AverageRate, CurrencyKey, EndOfDayRate, and TimeKey input columns are mapped correctly to the destination columns. If same-named columns are mapped, the mapping is correct. Then click OK.
- Now the pipeline should look something like that
- The last step is to hit the start button from the visual studio to run the full pipeline.
This concludes the tutorial for this post. However, this is just a very simple integration tutorial and there are much more to learn to fully cover SSIS services. For more insights, below, we give a summarized list to some of the most essential topics to cover as a follow-up to this tutorial.
Additional Learning Resources:
- Looping: It is one of the major features utilized by data engineers to iteratively perform recurrent tasks on multiple objects. You can refer to this link for more information.
- Package Configurations: Take advantage of controlling your package by adding variables and editing package configuration options. For more information, you can refer to this link.
- Logging: You need to learn more about keeping track and monitor the execution of your packages through the addition of logging to your package.
- Error Flow: You need to have a solid understanding of how to handle the errors that may occur throughout your transformation process. This includes error output handling configurations and redirection.