Pentaho master: June 2013

Thursday, June 27, 2013

Value Mapper

The Value Mapper maps string values from one value to another. Mapping is usually solved by storing the conversion table in a database. The Value Mapper provides you with a simple alternative.

Note: You can convert a null field or empty String value to a non-empty value. Leave the Source Value field empty. You can specify one of these empty source field values.
Note 2: Null or empty value won't be mapped to the "Default upon non-matching" value. They stay null. Use the above tip if you want to map them anyway.

Options

The following properties are used to define the mappings:

Option	Description
Step name	Name of the step; this name has to be unique in a single transformation.
Fieldname to use	Field to use as the mapping source
Target field name	Field to use as the mapping Target
Default upon non-matching	Defines a default value for situations where the source value is not empty, but there is no match
Field values table	Contains the mapping of source value to converted target value

The Unique Rows (HashSet) transformation step tracks exact duplicate rows. The step can also remove duplicate rows and leave only unique occurrences. Unlike the Unique Rows transformation step, which only correctly evaluates consecutive duplicate rows unless used with a sorted input, the Unique Rows (HashSet) step does not require a sorted input to process duplicate rows, instead it tracks duplicates in memory.

Option	Definition
Step name	Name of this step as it appears in the transformation workspace
Compare using stored row values	Stores values for the selected fields in memory for every record. Storing row values requires more memory, but it prevents possible false positives if there are hash collisions.
Redirect duplicate row	Processes duplicate rows as an error and redirect rows to the error stream of the step. Requires you to set error handling for this step.
Error description	Sets the error handling description to display when duplicate rows are detected. Only available when Redirect duplicate row is checked.
Fields to compare table	Lists the fields to compare—no entries means the step compares an entire row

Unique Rows

The Unique rows step removes duplicate rows from the input stream(s).

Important: Make sure that the input stream is sorted; otherwise, only consecutive double rows are evaluated correctly.

Options

The table below contains descriptions of all options for the Unique rows step:

Option	Description
Step name	Name of the step; this name has to be unique in a single transformation
Add counter to output?
Fields to compare table	Specify the field names on which you want to force uniqueness or click Get to insert all fields from the input stream(s) You can choose to ignore case by setting the Ignore case flag to Y. For example: Kettle, KETTLE, kettle are the same if the compare is performed as case-insensitive. In this instance, the first occurrence (Kettle) is passed to the next step(s).

Strings Cut

This step allows you to cut a portion of a substring. If the designated field is out of range, it returns blank.

Option	Definition
Step name	The name of this step as it appears in the transformation workspace.
In stream field	Name of the field whose substring to cut.
Out stream field	Rename the field upon output.
Cut from	Designate where to begin cutting the substring.
Cut to	Designate where to end cutting the substring.

Split Fields

Apply operations, such as trimming, padding, and others to the string value.

Option	Definition
Step name	The name of this step as it appears in the transformation workspace.
In stream field	Designate the field to transform.
Out stream field	Designate the name of the field to be created.
Trim type	Designate the trim type: none, left, right, or both.
Lower/Upper	Designate upper or lowercase.
Padding	Designate left or right padding.
Pad char	Designate the padding character.
Pad Length	Designate how long the padding will be.
InitCap	Transform to initial capitalization.
Escape
Digits	Designate whether to return remove, or do nothing to digits.
Remove Special character	Designate a special character to remove.

String Operations

Apply operations, such as trimming, padding, and others to the string value.

Option	Definition
Step name	The name of this step as it appears in the transformation workspace.
In stream field	Designate the field to transform.
Out stream field	Designate the name of the field to be created.
Trim type	Designate the trim type: none, left, right, or both.
Lower/Upper	Designate upper or lowercase.
Padding	Designate left or right padding.
Pad char	Designate the padding character.
Pad Length	Designate how long the padding will be.
InitCap	Transform to initial capitalization.
Escape
Digits	Designate whether to return remove, or do nothing to digits.
Remove Special character	Designate a special character to remove.

Split Fields

Description

The Field splitter step allows you to split fields based on delimiter information.

Options

The table below contains options configuring the Field splitter step:

Option	Description
Step name	Name of the step; this name has to be unique in a single transformation
Field to split	The name of the field you want to split
Delimiter	Delimiter that determines the end of a field
Fields table	This table is where you define the properties for each new field created by the split. For each new field, you must define the field name, data type, and other properties.

Split Field to Rows

Description

This step allows you to split a row containing a delimited field into multiple new rows, one per split value.
The original row is not delivered to following steps. The original field is left in each new row to assist in debugging or error handling. It can be removed via a Select Values step if needed.
The original field must be a string field.
A new field name must be specified.
If the value type/format of the new row is not specified, the new field will use the same type/format as the original field.

Options

Option	Description
Step name	Name of the step. Note: this name has to be unique in a single transformation
Field to split	The field you want to split
Delimiter	The delimiter or separator used
New name field

Sort Rows

The Sort rows step sorts rows based on the fields you specify and on whether they should be sorted in ascending or descending order.

Notes:

Kettle has to sort rows using temporary files when the number of rows exceeds the specified sort size (default 1 million rows). When you get an out of memory exception (OOME), you need to lower this size of change your available memory.

When you use multiple copies of the step in parallel (on the local JVM with "Change number of copies to start" or in a clustered environment using Carte) each of the sorted blocks need to be merged together to ensure the proper sort sequence. This can be done, be adding the Sorted Merge step afterwards (on the local JVM without multiple copies to start or in the cluster on the master).

Options

The following table describes the options associated with the Sort step:

Option	Description
Step name	Name of the step;this name has to be unique in a single transformation.
Sort directory	The directory in which the temporary files are stored in case when needed; the default is the standard temporary directory for the system
TMP-file prefix	Choose an easily recognized prefix so you can identify the files when they show up in the temp directory.
Sort size	The more rows you store in memory, the faster the sorting process because fewer temporary files must be used and less I/O is generated.
Free memory threshold (in %)	If the sort algorithm finds that it has less available free memory than the indicated number, it will start to page data to disk. Note: This is not exact science, because: This is checked every 1000 rows. Depending on the row size and other steps within complex transformations this could still lead to an OutOfMemoryError. In a Java Virtual Machine it's not possible to know the exact amount of free memory. As such we recommend you don't use this for very complex transformations with other steps and processes that use up a lot of memory.
Compress TMP Files	Compresses temporary files when they are needed to complete the sort.
Only pass unique rows?	Enable if you want to pass unique rows only to the output stream(s).
Fields table	Specify the fields and direction (ascending/descending) to sort. You can specify whether to perform a case sensitive sort (optional)
Get Fields	Click to retrieve a list of all fields coming in on the stream(s).

Set Field Value to a Constant

This step allows you to set the value of a field with a user-defined constant.

Option	Definition
Step name	The name of this step as it appears in the transformation workspace.
Use variable in constant	Selecting this option allows you to use variable replacement within a constant.
Field	Displays the fields with values that can be replaced.
Replace by value	The value that will replace existing values within the specified field.
Conversion mask (Date)	Allows you to specify a date format to convert to.

Set Field Value

Set the value of a field with the value of another field.

Option	Definition
Step name	The name of this step as it appears in the transformation workspace.
Field name	Displays the fields with values that can be replaced.
Replace by value from field	Specify the field value to replace the values in the Field name column.

Replace in String

Replace in string is a simple search and replace. It also supports regular expressions and group references. group references are picked up in the replace by string as $n where n is the number of the group.

For example
you could parse out sections of a string as follows
regex = ^([0-9]{4})([0-9]{7})$
replace with = $2
This will replace any 11 digit numbers with the last 7 digits.

Number Range

it will create ranges based on numeric fields.

Option	Definition
Step name	The name of this step as it appears in the transformation workspace.
Input field	Designate the field to create ranges for.
Output field	Designate the name of the field of ranges to be created.
Default value	Value to return if there are no matches within the ranges specified.
Ranges	Designated the upper and lower bound of a range.
Lower Bound	Designated the minimum value of a range.
Upper Bound	Designate the upper value of a range.
Value	Designated a name for the value.

Calculator

This step provides you with predefined functions that can be executed on input field values.

Besides the arguments (Field A, Field B and Field C) you must also specify the return type of the function. You can also choose to remove the field from the result (output) after all values are calculated.

The table below contains descriptions of the fields associated with the calculator step:

Function	Description	Required fields
Set field to constant A	Create a field with a constant value.	A
A + B	A plus B.	A and B
A - B	A minus B.	A and B
A * B	A multiplied by B.	A and B
A / B	A divided by B.	A and B
A * A	The square of A.	A
SQRT( A )	The square root of A.	A
100 * A / B	Percentage of A in B.	A and B
A - ( A * B / 100 )	Subtract B% of A.	A and B
A + ( A * B / 100 )	Add B% to A.	A and B
A + B *C	Add A and B times C.	A, B and C
SQRT( AA + BB )	Calculate ?(A2+B2).	A and B
ROUND( A )	Round A to the nearest integer.	A
ROUND( A, B )	Round A to B decimal positions.	A and B
NVL( A, B )	If A is not NULL, return A, else B. Note that sometimes your variable won't be null but an empty string.	A and B
Date A + B days	Add B days to Date field A.	A and B
Year of date A	Calculate the year of date A.	A
Month of date A	Calculate number the month of date A.	A
Day of year of date	A Calculate the day of year (1-365).	A
Day of month of date A	Calculate the day of month (1-31).	A
Day of week of date A	Calculate the day of week (1-7).	A
Week of year of date A	Calculate the week of year (1-54).	A
ISO8601 Week of year of date A	Calculate the week of the year ISO8601 style (1-53).	A
ISO8601 Year of date A	Calculate the year ISO8601 style.	A
Byte to hex encode of string A	Encode bytes in a string to a hexadecimal representation.	A
Hex encode of string A	Encode a string in its own hexadecimal representation.	A
Char to hex encode of string A	Encode characters in a string to a hexadecimal representation.	A
Hex decode of string A	Decode a string from its hexadecimal representation (add a leading 0 when A is of odd length).	A
Checksum of a file A using CRC-32	Calculate the checksum of a file using CRC-32.	A
Checksum of a file A using Adler-32	Calculate the checksum of a file using Adler-32.	A
Checksum of a file A using MD5	Calculate the checksum of a file using MD5.	A
Checksum of a file A using SHA-1	Calculate the checksum of a file using SHA-1.	A
Levenshtein Distance (Source A and Target B)	Calculates the Levenshtein Distance.	A and B
Metaphone of A (Phonetics)	Calculates the metaphone of A.	A
Double metaphone of A	Calculates the double metaphone of A.	A
Absolute value ABS(A)	Calculates the Absolute value of A.	A
Remove time from a date A	Removes time value of A.	A
Date A - Date B (in days)	Calculates difference, in days, between A date field and B date field.	A and B
A + B + C	A plus B plus C.	A, B, and C
First letter of each word of a string A in capital	Transforms the first letter of each word within a string.	A
UpperCase of a string A	Transforms a string to uppercase.	A
LowerCase of a string A	Transforms a string to lowercase.	A
Mask XML content from string A	Escape XML content; replace characters with &values.	A
Protect (CDATA) XML content from string A	Indicates an XML string is general character data, rather than non-character data or character data with a more specific, limited structure. The given string will be enclosed into <![CDATA[String]]>.	A
Remove CR from a string A	Removes carriage returns from a string.	A
Remove LF from a string A	Removes linefeeds from a string.	A
Remove CRLF from a string A	Removes carriage returns/linefeeds from a string.	A
Remove TAB from a string A	Removes tab characters from a string.	A
Return only digits from string A	Outputs only Outputs only digits (0-9) from a string from a string.	A
Remove digits from string A	Removes all digits (0-9) from a string.	A
Return the length of a string A	Returns the length of the string.	A
Load file content in binary	Loads the content of the given file (in field A) to a binary data type (e.g. pictures).	A
Add time B to date A	Add the time to a date, returns date and time as one value.	A and B
Quarter of date A	Returns the quarter (1 to 4) of the date.	A
variable substitution in string A	Substitute variables within a string.	A
Unescape XML content	Unescape XML content from the string.	A
Escape HTML content	Escape HTML within the string.	A
Unescape HTML content	Unescape HTML within the string.	A
Escape SQL content	Escapes the characters in a String to be suitable to pass to an SQL query.	A
Date A - Date B (working days)	Calculates the difference between Date field A and Date field B (only working days Mon-Fri).	A and B
Date A + B Months	Add B months to Date field A.	A
Check if an XML file A is well formed	Validates XML file input.	A
Check if an XML string A is well formed	Validates XML string input.	A
Get encoding of file A	Guess the best encoding (UTF-8) for the given file.	A
Dameraulevenshtein distance between String A and String B	Calculates Dameraulevenshtein distance between strings.	A and B
NeedlemanWunsch distance between String A and String B	Calculates NeedlemanWunsch distance between strings.	A and B
Jaro similitude between String A and String B	Returns the Jaro similarity coefficient between two strings.	A and B
JaroWinkler similitude between String A and String B	Returns the Jaro similarity coefficient between two strings.	A and B
SoundEx of String A	Encodes a string into a Soundex value.	A
RefinedSoundEx of String A	Retrieves the Refined Soundex code for a given string object	A
Date A + B Hours	Add B hours to Date field A	A and B
Date A + B Minutes	Add B minutes to Date field A	A and B
Date A - Date B (milliseconds)	Subtract B milliseconds from Date field A	A and B
Date A - Date B (seconds)	Subtract B seconds from Date field A	A and B
Date A - Date B (minutes)	Subtract B minutes from Date field A	A and B
Date A - Date B (hours)	Subtract B hours from Date field A	A and B

Pentaho master

Thursday, June 27, 2013

Value Mapper

Options

Unique Rows (HashSet)

Unique Rows

Options

Strings Cut

Split Fields

String Operations

Split Fields

Description

Options

Split Field to Rows

Description

Options

Sort Rows

Options

Set Field Value to a Constant

Set Field Value

Replace in String

Number Range

Calculator

Pages

About Me