Monday, November 4, 2013

Getting Connection information on the stream

Sometimes I need access to the database connection information inside a PDI transformation. I can use a Java Script scripting step to fetch the metadata for all database connections and put them on the stream.  Since the Java Script step is not an input step, I'll need a Generate Rows step to act as a trigger.  It will have a Limit of 1 and no fields:




The JavaScript step looks like the following:



The code is on PasteBin here.

NOTE: Make sure the Compatibility mode checkbox is not checked (see above image).

This can be extended to fetch other information such as username, tablespace, etc.  Alternatively, you can hop to Roland's transformation to grab all JDBC metadata.

Wednesday, October 30, 2013

Pentaho Data Integration ( PDI ) - Generic Design Guidelines

Design for Failure Handling


Recommended to ensure that the data source is available before a process is kicked off. One basic design principle is that the ETL job needs to be able to fail gracefully when a data availability test fails.

Kettle contains following features to do this.

  • Test a repository connection.
  • Ping a host to check whether it's available.
  • Wait for a SQL command to return success/failure based on a row count condition.
  • Check for empty folders.
  • Check for the existence of a file, table, or column.
  • Compare files or folders.
  • Set a timeout on FTP and SSH connections.
  • Create failure/success outputs on every available job step.

Multi Tier Loading Strategies

For ETL processes it is recommended to employ a pre-load staging area. This provides the ability to pull data from the production system and use it for data cleaning and harmonization activities without interfering with the production systems. By leveraging this type of strategy we will be able to see real production data sooner and follow the guiding principle of ‘Convert Early, Convert Often, and with Real Production Data'.


Table Driven

Developers frequently need to perform a large amount of cross-referencing, hard-coding of values, or other repeatable data transformations during data migration. These transformations often have a probability to change over time. Without a table driven approach this will cause code changes, bug fixes, re-testing, and re-deployments during the development effort. This work is unnecessary on many occasions and could be avoided with the use of reference tables. It is recommended to use table driven approaches such as these whenever possible. 

Some common table driven approaches include:

  • Default Values
Hard-coded values for a given column, stored in a table where the values could be changed whenever a requirement changes. For example, a hard coded value of NA for any value not populated and then want to change that value to NV we could simply change the value in a default value table rather than change numerous hard-coded values. This should be determined for all lookup steps and stream lookup steps.

  • Cross-Reference Values
Frequently in data migration projects there is a need to take values from the source system and convert them to the value of the target system. These values are usually identified up-front, but as the source system changes additional values are also needed. In a typical development situation this would require adding additional values to a series of IF statements. With a table driven situation, new data could be added to a cross-reference table and no coding, testing, or deployment would be required.

  • Code-Driven Table
In some instances a set of understood rules are known. By taking those rules and building code against them, a table-driven/code solution can be very productive. For example, if we had a rules table that was keyed by table/column/rule id, then whenever that combination was found a pre-set piece of code would be executed. If at a later date the rules change to a different set of pre-determined rules, the rule table could change for the column and no additional coding would be required.


Reusability


Re-use should always be considered during Pentaho ETL development. There is often tremendous opportunity for re-use of jobs/transformations/processes/scripts/testing documents. This reduces the staff time and lowers project costs.

Being able to reuse existing parts of ETL solution is an indispensable PDI feature. Easy ways of doing this is to copy and paste or duplicate existing transformation steps, but that's not really reuse. The term reuse refers to the capability to define a step or transformation once and call the same component from different places. Within Kettle this is achieved by the Mapping step, which lets reuse existing transformations over and over as subcomponents in other transformations. Transformations themselves can be used multiple times in multiple jobs, and the same applies to jobs which can be reused as sub jobs in other jobs as well.


Common Error Handling

For any ETL solution, it is critical to find data exceptions and write appropriate rules to correct these situations throughout the data migration effort. It is highly recommended to build a common method for capturing and recording these exceptions. This common method should then be deployed for all data migration processes.


Simplify

For data migration projects it is possible to build one process to pull all data for a given entity from all systems to the target system. While this may seem ideal, these types of complex processes take much longer to design and develop, are challenging to test, and are very difficult to maintain over time. Due to these drawbacks, it is recommended developing many simple processes as needed to complete the effort rather than a few complex processes.

Pentaho Data Integration - PDI Installation for Windows


Download latest PDI version from http://pentaho.com/download.
Choose 32 Bit or 64 Bit based on OS requirements.




Step 2 :

PDI 4.2.1 version used for this demo purpose.
Click on pdi-4.2.1-GA-i386.exe and accept the license agreements.








Step 3 :

PDI enterprise edition set up will pop up. Please click next.


















Step 4 :
Set up Administration account for Pentaho Enterprise Console.
Enter password. Note down the password for later reference.


















Step 5 :

Set up will begin installing Pentaho applications.
Evaluation licenses will be auto generated and installed for trial versions.



Step 6 :

Installation Complete - Enterprise Console and Spoon can be launched for configuration.

Pentaho Repository Queries


User Info

SELECT LOGIN, NAME, DESCRIPTION, ENABLED FROM R_USER

Job Info

 SELECT NAME, DESCRIPTION, JOB_VERSION, JOB_STATUS, CREATED_USER, CREATED_DATE, MODIFIED_USER, MODIFIED_DATE FROM R_JOB
Transformation Info

 SELECT NAME, DESCRIPTION, TRANS_VERSION, TRANS_STATUS, CREATED_USER, CREATED_DATE, MODIFIED_USER, MODIFIED_DATE FROM R_TRANSFORMATION

Step Info

SELECT S.NAME STEP_NAME, S.DESCRIPTION, ST.DESCRIPTION AS STEP_TYPE, T.NAME AS TRANS_NAME FROM R_STEP S, R_STEP_TYPE ST, R_TRANSFORMATION T WHERE S.ID_STEP_TYPE = ST.ID_STEP_TYPE AND S.ID_TRANSFORMATION = T.ID_TRANSFORMATION Job Entry Info SELECT E.NAME JOB_ENTRY_NAME, E.DESCRIPTION, JT.DESCRIPTION AS ENTRY_TYPE, J.NAME AS JOB_NAME FROM R_JOBENTRY E, R_JOBENTRY_TYPE JT, R_JOB J WHERE E.ID_JOBENTRY_TYPE = JT.ID_JOBENTRY_TYPE AND E.ID_JOB = J.ID_JOB

DB Connection Info

SELECT T.DESCRIPTION DATABASE_TYPE, D.NAME NAME, D.HOST_NAME, D.DATABASE_NAME, D.PORT, D.USERNAME FROM R_DATABASE D, R_DATABASE_TYPE T WHERE D.ID_DATABASE_TYPE = T.ID_DATABASE_TYPE

 Dependencies Check for all transformations using given table, say PRODUCT_MASTER

SELECT S.NAME STEP_NAME, S.DESCRIPTION, ST.DESCRIPTION AS STEP_TYPE, T.NAME AS TRANS_NAME, SA.VALUE_STR FROM R_STEP S, R_STEP_TYPE ST, R_TRANSFORMATION T, R_STEP_ATTRIBUTE SA WHERE S.ID_STEP_TYPE = ST.ID_STEP_TYPE AND S.ID_TRANSFORMATION = T.ID_TRANSFORMATION AND SA.ID_TRANSFORMATION = T.ID_TRANSFORMATION AND SA.ID_STEP = S.ID_STEP AND SA.CODE = 'sql' and SA.VALUE_STR like '%PRODUCT_MASTER%'

Monday, August 26, 2013

Database lookup

Description The Database lookup step allows you to look up values in a database table,this will go to database to lookup for each and every record. Lookup values are added as new fields onto the stream. Options The following table describes the available options for configuring the database lookup: Option Description Step name Name of the step; this name has to be unique in a single transformation. Connection Database connection to the lookup table. Lookup schema Database schema containing the lookup table. Lookup Table Name of the database table used for the lookup. Enable cache? Enables caching of database lookups. This means that once a key (or group of key) has been looked up, the looked up values are stored, and returned again the next time this key (or group of key) is being looked up (without incurring the cost of a database call). Cache size in rows The size of the cache (number of rows). Load all data from table Pre-loads the cache with all the data present in the lookup table. This may improve performance by avoiding database calls. However, if you have a large table, you risk running out of memory. IMPORTANT: In memory look-ups can lead to different results because of the differences in the way your database compares data. For example, if your database table allows case-insensitive comparisons, you may get different results with this option. The same goes for data with trailing spaces. Those are ignored in certain databases when using certain character data types. * Keys to look up table The keys and conditions to perform the database lookup. Values to return table The fields from the lookup table to add to the output stream. "New name" allows you to use a different name if the database column name is inappropriate. "Default" is the value returned (instead of null) if the lookup fails. Note that enabling error handling on the Database Lookup step will redirect all failed lookup rows to the error handling step, instead of adding them to the main output steam with null/default values. "Type" is the type of the output field. Do not pass the row if the lookup fails Enable to avoid passing a row when lookup fails. Fail on multiple results? Enable to force the step to fail if the lookup returns multiple results. Order by If the lookup query returns multiple results, the ORDER BY clause helps you to select the record to take. For example, ORDER BY would allow you to pick the customer with the highest sales volume in a specified state. Get Fields Click to return a list of available fields from the input stream(s) of the step. Get lookup fields Click to return a list of available fields from the lookup table that can be added to the step's output stream.

Thursday, June 27, 2013

Value Mapper

The Value Mapper maps string values from one value to another. Mapping is usually solved by storing the conversion table in a database. The Value Mapper provides you with a simple alternative.

Note: You can convert a null field or empty String value to a non-empty value. Leave the Source Value field empty. You can specify one of these empty source field values.
Note 2: Null or empty value won't be mapped to the "Default upon non-matching" value. They stay null. Use the above tip if you want to map them anyway.

Options

The following properties are used to define the mappings:
Option Description
Step name Name of the step; this name has to be unique in a single transformation.
Fieldname to use Field to use as the mapping source
Target field name Field to use as the mapping Target
Default upon non-matching Defines a default value for situations where the source value is not empty, but there is no match
Field values table Contains the mapping of source value to converted target value

Unique Rows (HashSet)

The Unique Rows (HashSet) transformation step tracks exact duplicate rows. The step can also remove duplicate rows and leave only unique occurrences. Unlike the Unique Rows transformation step, which only correctly evaluates consecutive duplicate rows unless used with a sorted input, the Unique Rows (HashSet) step does not require a sorted input to process duplicate rows, instead it tracks duplicates in memory.
Option
Definition
Step name
Name of this step as it appears in the transformation workspace
Compare using stored row values
Stores values for the selected fields in memory for every record. Storing row values requires more memory, but it prevents possible false positives if there are hash collisions.
Redirect duplicate row
Processes duplicate rows as an error and redirect rows to the error stream of the step. Requires you to set error handling for this step.
Error description
Sets the error handling description to display when duplicate rows are detected. Only available when Redirect duplicate row is checked.
Fields to compare table
Lists the fields to compare—no entries means the step compares an entire row