Wednesday, October 30, 2013

Pentaho Data Integration ( PDI ) - Generic Design Guidelines

Design for Failure Handling


Recommended to ensure that the data source is available before a process is kicked off. One basic design principle is that the ETL job needs to be able to fail gracefully when a data availability test fails.

Kettle contains following features to do this.

  • Test a repository connection.
  • Ping a host to check whether it's available.
  • Wait for a SQL command to return success/failure based on a row count condition.
  • Check for empty folders.
  • Check for the existence of a file, table, or column.
  • Compare files or folders.
  • Set a timeout on FTP and SSH connections.
  • Create failure/success outputs on every available job step.

Multi Tier Loading Strategies

For ETL processes it is recommended to employ a pre-load staging area. This provides the ability to pull data from the production system and use it for data cleaning and harmonization activities without interfering with the production systems. By leveraging this type of strategy we will be able to see real production data sooner and follow the guiding principle of ‘Convert Early, Convert Often, and with Real Production Data'.


Table Driven

Developers frequently need to perform a large amount of cross-referencing, hard-coding of values, or other repeatable data transformations during data migration. These transformations often have a probability to change over time. Without a table driven approach this will cause code changes, bug fixes, re-testing, and re-deployments during the development effort. This work is unnecessary on many occasions and could be avoided with the use of reference tables. It is recommended to use table driven approaches such as these whenever possible. 

Some common table driven approaches include:

  • Default Values
Hard-coded values for a given column, stored in a table where the values could be changed whenever a requirement changes. For example, a hard coded value of NA for any value not populated and then want to change that value to NV we could simply change the value in a default value table rather than change numerous hard-coded values. This should be determined for all lookup steps and stream lookup steps.

  • Cross-Reference Values
Frequently in data migration projects there is a need to take values from the source system and convert them to the value of the target system. These values are usually identified up-front, but as the source system changes additional values are also needed. In a typical development situation this would require adding additional values to a series of IF statements. With a table driven situation, new data could be added to a cross-reference table and no coding, testing, or deployment would be required.

  • Code-Driven Table
In some instances a set of understood rules are known. By taking those rules and building code against them, a table-driven/code solution can be very productive. For example, if we had a rules table that was keyed by table/column/rule id, then whenever that combination was found a pre-set piece of code would be executed. If at a later date the rules change to a different set of pre-determined rules, the rule table could change for the column and no additional coding would be required.


Reusability


Re-use should always be considered during Pentaho ETL development. There is often tremendous opportunity for re-use of jobs/transformations/processes/scripts/testing documents. This reduces the staff time and lowers project costs.

Being able to reuse existing parts of ETL solution is an indispensable PDI feature. Easy ways of doing this is to copy and paste or duplicate existing transformation steps, but that's not really reuse. The term reuse refers to the capability to define a step or transformation once and call the same component from different places. Within Kettle this is achieved by the Mapping step, which lets reuse existing transformations over and over as subcomponents in other transformations. Transformations themselves can be used multiple times in multiple jobs, and the same applies to jobs which can be reused as sub jobs in other jobs as well.


Common Error Handling

For any ETL solution, it is critical to find data exceptions and write appropriate rules to correct these situations throughout the data migration effort. It is highly recommended to build a common method for capturing and recording these exceptions. This common method should then be deployed for all data migration processes.


Simplify

For data migration projects it is possible to build one process to pull all data for a given entity from all systems to the target system. While this may seem ideal, these types of complex processes take much longer to design and develop, are challenging to test, and are very difficult to maintain over time. Due to these drawbacks, it is recommended developing many simple processes as needed to complete the effort rather than a few complex processes.

Pentaho Data Integration - PDI Installation for Windows


Download latest PDI version from http://pentaho.com/download.
Choose 32 Bit or 64 Bit based on OS requirements.




Step 2 :

PDI 4.2.1 version used for this demo purpose.
Click on pdi-4.2.1-GA-i386.exe and accept the license agreements.








Step 3 :

PDI enterprise edition set up will pop up. Please click next.


















Step 4 :
Set up Administration account for Pentaho Enterprise Console.
Enter password. Note down the password for later reference.


















Step 5 :

Set up will begin installing Pentaho applications.
Evaluation licenses will be auto generated and installed for trial versions.



Step 6 :

Installation Complete - Enterprise Console and Spoon can be launched for configuration.

Pentaho Repository Queries


User Info

SELECT LOGIN, NAME, DESCRIPTION, ENABLED FROM R_USER

Job Info

 SELECT NAME, DESCRIPTION, JOB_VERSION, JOB_STATUS, CREATED_USER, CREATED_DATE, MODIFIED_USER, MODIFIED_DATE FROM R_JOB
Transformation Info

 SELECT NAME, DESCRIPTION, TRANS_VERSION, TRANS_STATUS, CREATED_USER, CREATED_DATE, MODIFIED_USER, MODIFIED_DATE FROM R_TRANSFORMATION

Step Info

SELECT S.NAME STEP_NAME, S.DESCRIPTION, ST.DESCRIPTION AS STEP_TYPE, T.NAME AS TRANS_NAME FROM R_STEP S, R_STEP_TYPE ST, R_TRANSFORMATION T WHERE S.ID_STEP_TYPE = ST.ID_STEP_TYPE AND S.ID_TRANSFORMATION = T.ID_TRANSFORMATION Job Entry Info SELECT E.NAME JOB_ENTRY_NAME, E.DESCRIPTION, JT.DESCRIPTION AS ENTRY_TYPE, J.NAME AS JOB_NAME FROM R_JOBENTRY E, R_JOBENTRY_TYPE JT, R_JOB J WHERE E.ID_JOBENTRY_TYPE = JT.ID_JOBENTRY_TYPE AND E.ID_JOB = J.ID_JOB

DB Connection Info

SELECT T.DESCRIPTION DATABASE_TYPE, D.NAME NAME, D.HOST_NAME, D.DATABASE_NAME, D.PORT, D.USERNAME FROM R_DATABASE D, R_DATABASE_TYPE T WHERE D.ID_DATABASE_TYPE = T.ID_DATABASE_TYPE

 Dependencies Check for all transformations using given table, say PRODUCT_MASTER

SELECT S.NAME STEP_NAME, S.DESCRIPTION, ST.DESCRIPTION AS STEP_TYPE, T.NAME AS TRANS_NAME, SA.VALUE_STR FROM R_STEP S, R_STEP_TYPE ST, R_TRANSFORMATION T, R_STEP_ATTRIBUTE SA WHERE S.ID_STEP_TYPE = ST.ID_STEP_TYPE AND S.ID_TRANSFORMATION = T.ID_TRANSFORMATION AND SA.ID_TRANSFORMATION = T.ID_TRANSFORMATION AND SA.ID_STEP = S.ID_STEP AND SA.CODE = 'sql' and SA.VALUE_STR like '%PRODUCT_MASTER%'