Thursday, June 27, 2013

Value Mapper

The Value Mapper maps string values from one value to another. Mapping is usually solved by storing the conversion table in a database. The Value Mapper provides you with a simple alternative.

Note: You can convert a null field or empty String value to a non-empty value. Leave the Source Value field empty. You can specify one of these empty source field values.
Note 2: Null or empty value won't be mapped to the "Default upon non-matching" value. They stay null. Use the above tip if you want to map them anyway.

Options

The following properties are used to define the mappings:
Option Description
Step name Name of the step; this name has to be unique in a single transformation.
Fieldname to use Field to use as the mapping source
Target field name Field to use as the mapping Target
Default upon non-matching Defines a default value for situations where the source value is not empty, but there is no match
Field values table Contains the mapping of source value to converted target value

Unique Rows (HashSet)

The Unique Rows (HashSet) transformation step tracks exact duplicate rows. The step can also remove duplicate rows and leave only unique occurrences. Unlike the Unique Rows transformation step, which only correctly evaluates consecutive duplicate rows unless used with a sorted input, the Unique Rows (HashSet) step does not require a sorted input to process duplicate rows, instead it tracks duplicates in memory.
Option
Definition
Step name
Name of this step as it appears in the transformation workspace
Compare using stored row values
Stores values for the selected fields in memory for every record. Storing row values requires more memory, but it prevents possible false positives if there are hash collisions.
Redirect duplicate row
Processes duplicate rows as an error and redirect rows to the error stream of the step. Requires you to set error handling for this step.
Error description
Sets the error handling description to display when duplicate rows are detected. Only available when Redirect duplicate row is checked.
Fields to compare table
Lists the fields to compare—no entries means the step compares an entire row

Unique Rows

The Unique rows step removes duplicate rows from the input stream(s).
Important: Make sure that the input stream is sorted; otherwise, only consecutive double rows are evaluated correctly.

Options

The table below contains descriptions of all options for the Unique rows step:
Option Description
Step name Name of the step; this name has to be unique in a single transformation
Add counter to output?  
Fields to compare table Specify the field names on which you want to force uniqueness or click Get to insert all fields from the input stream(s) You can choose to ignore case by setting the Ignore case flag to Y. For example: Kettle, KETTLE, kettle are the same if the compare is performed as case-insensitive. In this instance, the first occurrence (Kettle) is passed to the next step(s).

Strings Cut

This step allows you to cut a portion of a substring. If the designated field is out of range, it returns blank.
Option
Definition
Step name
The name of this step as it appears in the transformation workspace.
In stream field
Name of the field whose substring to cut.
Out stream field
Rename the field upon output.
Cut from
Designate where to begin cutting the substring.
Cut to
Designate where to end cutting the substring.

Split Fields

Apply operations, such as trimming, padding, and others to the string value.
Option
Definition
Step name
The name of this step as it appears in the transformation workspace.
In stream field
Designate the field to transform.
Out stream field
Designate the name of the field to be created.
Trim type
Designate the trim type: none, left, right, or both.
Lower/Upper
Designate upper or lowercase.
Padding
Designate left or right padding.
Pad char
Designate the padding character.
Pad Length
Designate how long the padding will be.
InitCap
Transform to initial capitalization.
Escape

Digits
Designate whether to return remove, or do nothing to digits.
Remove Special character
Designate a special character to remove.

String Operations

Apply operations, such as trimming, padding, and others to the string value.
Option
Definition
Step name
The name of this step as it appears in the transformation workspace.
In stream field
Designate the field to transform.
Out stream field
Designate the name of the field to be created.
Trim type
Designate the trim type: none, left, right, or both.
Lower/Upper
Designate upper or lowercase.
Padding
Designate left or right padding.
Pad char
Designate the padding character.
Pad Length
Designate how long the padding will be.
InitCap
Transform to initial capitalization.
Escape

Digits
Designate whether to return remove, or do nothing to digits.
Remove Special character
Designate a special character to remove.

Split Fields

Description

The Field splitter step allows you to split fields based on delimiter information.

Options

The table below contains options configuring the Field splitter step:
Option Description
Step name Name of the step; this name has to be unique in a single transformation
Field to split The name of the field you want to split
Delimiter Delimiter that determines the end of a field
Fields table This table is where you define the properties for each new field created by the split. For each new field, you must define the field name, data type, and other properties.

Split Field to Rows

Description

This step allows you to split a row containing a delimited field into multiple new rows, one per split value.
The original row is not delivered to following steps. The original field is left in each new row to assist in debugging or error handling.  It can be removed via a Select Values step if needed.
The original field must be a string field.
A new field name must be specified.
If the value type/format of the new row is not specified, the new field will use the same type/format as the original field.

Options

Option Description
Step name Name of the step. Note: this name has to be unique in a single transformation
Field to splitThe field you want to split
DelimiterThe delimiter or separator used
New name field

Sort Rows

The Sort rows step sorts rows based on the fields you specify and on whether they should be sorted in ascending or descending order.
Notes:
  • Kettle has to sort rows using temporary files when the number of rows exceeds the specified sort size (default 1 million rows). When you get an out of memory exception (OOME), you need to lower this size of change your available memory.
  • When you use multiple copies of the step in parallel (on the local JVM with "Change number of copies to start" or in a clustered environment using Carte) each of the sorted blocks need to be merged together to ensure the proper sort sequence. This can be done, be adding the Sorted Merge step afterwards (on the local JVM without multiple copies to start or in the cluster on the master).

Options

The following table describes the options associated with the Sort step:
Option Description
Step name Name of the step;this name has to be unique in a single transformation.
Sort directory The directory in which the temporary files are stored in case when needed; the default is the standard temporary directory for the system
TMP-file prefix Choose an easily recognized prefix so you can identify the files when they show up in the temp directory.
Sort size The more rows you store in memory, the faster the sorting process because fewer temporary files must be used and less I/O is generated.
Free memory threshold (in %) If the sort algorithm finds that it has less available free memory than the indicated number, it will start to page data to disk.
Note: This is not exact science, because:
  1. This is checked every 1000 rows. Depending on the row size and other steps within complex transformations this could still lead to an OutOfMemoryError.
  2. In a Java Virtual Machine it's not possible to know the exact amount of free memory.  As such we recommend you don't use this for very complex transformations with other steps and processes that use up a lot of memory.
Compress TMP Files Compresses temporary files when they are needed to complete the sort.
Only pass unique rows? Enable if you want to pass unique rows only to the output stream(s).
Fields table Specify the fields and direction (ascending/descending) to sort. You can specify whether to perform a case sensitive sort (optional)
Get Fields Click to retrieve a list of all fields coming in on the stream(s).

Set Field Value to a Constant

This step allows you to set the value of a field with a user-defined constant.
Option
Definition
Step name
The name of this step as it appears in the transformation workspace.
Use variable in constant
Selecting this option allows you to use variable replacement within a constant.
Field
Displays the fields with values that can be replaced.
Replace by value
The value that will replace existing values within the specified field.
Conversion mask (Date)
Allows you to specify a date format to convert to.

Set Field Value

Set the value of a field with the value of another field.
Option
Definition
Step name
The name of this step as it appears in the transformation workspace.
Field name
Displays the fields with values that can be replaced.
Replace by value from field
Specify the field value to replace the values in the Field name column.


Replace in String

Replace in string is a simple search and replace. It also supports regular expressions and group references. group references are picked up in the replace by string as $n where n is the number of the group.

For example
you could parse out sections of a string as follows
regex = ^([0-9]{4})([0-9]{7})$
replace with = $2
This will replace any 11 digit numbers with the last 7 digits.

Number Range

it will create ranges based on numeric fields.
Option
Definition
Step name
The name of this step as it appears in the transformation workspace.
Input field
Designate the field to create ranges for.
Output field
Designate the name of the field of ranges to be created.
Default value
Value to return if there are no matches within the ranges specified.
Ranges
Designated the upper and lower bound of a range.
Lower Bound
Designated the minimum value of a range.
Upper Bound
Designate the upper value of a range.
Value
Designated a name for the value.

Calculator

This step provides you with predefined functions that can be executed on input field values.
Besides the arguments (Field A, Field B and Field C) you must also specify the return type of the function. You can also choose to remove the field from the result (output) after all values are calculated.
The table below contains descriptions of the fields associated with the calculator step:
Function
Description
Required fields
Set field to constant A
Create a field with a constant value.
A
A + B
A plus B.
A and B
A - B
A minus B.
A and B
A * B
A multiplied by B.
A and B
A / B
A divided by B.
A and B
A * A
The square of A.
A
SQRT( A )
The square root of A.
A
100 * A / B
Percentage of A in B.
A and B
A - ( A * B / 100 )
Subtract B% of A.
A and B
A + ( A * B / 100 )
Add B% to A.
A and B
A + B *C
Add A and B times C.
A, B and C
SQRT( A*A + B*B )
Calculate ?(A2+B2).
A and B
ROUND( A )
Round A to the nearest integer.
A
ROUND( A, B )
Round A to B decimal positions.
A and B
NVL( A, B )
If A is not NULL, return A, else B. Note that sometimes your variable won't be null but an empty string.
A and B
Date A + B days
Add B days to Date field A.
A and B
Year of date A
Calculate the year of date A.
A
Month of date A
Calculate number the month of date A.
A
Day of year of date
A Calculate the day of year (1-365).
A
Day of month of date A
Calculate the day of month (1-31).
A
Day of week of date A
Calculate the day of week (1-7).
A
Week of year of date A
Calculate the week of year (1-54).
A
ISO8601 Week of year of date A
Calculate the week of the year ISO8601 style (1-53).
A
ISO8601 Year of date A
Calculate the year ISO8601 style.
A
Byte to hex encode of string A
Encode bytes in a string to a hexadecimal representation.
A
Hex encode of string A
Encode a string in its own hexadecimal representation.
A
Char to hex encode of string A
Encode characters in a string to a hexadecimal representation.
A
Hex decode of string A
Decode a string from its hexadecimal representation (add a leading 0 when A is of odd length).
A
Checksum of a file A using CRC-32
Calculate the checksum of a file using CRC-32.
A
Checksum of a file A using Adler-32
Calculate the checksum of a file using Adler-32.
A
Checksum of a file A using MD5
Calculate the checksum of a file using MD5.
A
Checksum of a file A using SHA-1
Calculate the checksum of a file using SHA-1.
A
Levenshtein Distance (Source A and Target B)
Calculates the Levenshtein Distance.
A and B
Metaphone of A (Phonetics)
Calculates the metaphone of A.
A
Double metaphone of A
Calculates the double metaphone of A.
A
Absolute value ABS(A)
Calculates the Absolute value of A.
A
Remove time from a date A
Removes time value of A.
A
Date A - Date B (in days)
Calculates difference, in days, between A date field and B date field.
A and B
A + B + C
A plus B plus C.
A, B, and C
First letter of each word of a string A in capital
Transforms the first letter of each word within a string.
A
UpperCase of a string A
Transforms a string to uppercase.
A
LowerCase of a string A
Transforms a string to lowercase.
A
Mask XML content from string A
Escape XML content; replace characters with &values.
A
Protect (CDATA) XML content from string A
Indicates an XML string is general character data, rather than non-character data or character data with a more specific, limited structure. The given string will be enclosed into <![CDATA[String]]>.
A
Remove CR from a string A
Removes carriage returns from a string.
A
Remove LF from a string A
Removes linefeeds from a string.
A
Remove CRLF from a string A
Removes carriage returns/linefeeds from a string.
A
Remove TAB from a string A
Removes tab characters from a string.
A
Return only digits from string A
Outputs only Outputs only digits (0-9) from a string from a string.
A
Remove digits from string A
Removes all digits (0-9) from a string.
A
Return the length of a string A
Returns the length of the string.
A
Load file content in binary
Loads the content of the given file (in field A) to a binary data type (e.g. pictures).
A
Add time B to date A
Add the time to a date, returns date and time as one value.
A and B
Quarter of date A
Returns the quarter (1 to 4) of the date.
A
variable substitution in string A
Substitute variables within a string.
A
Unescape XML content
Unescape XML content from the string.
A
Escape HTML content
Escape HTML within the string.
A
Unescape HTML content
Unescape HTML within the string.
A
Escape SQL content
Escapes the characters in a String to be suitable to pass to an SQL query.
A
Date A - Date B (working days)
Calculates the difference between Date field A and Date field B (only working days Mon-Fri).
A and B
Date A + B Months
Add B months to Date field A.
A
Check if an XML file A is well formed
Validates XML file input.
A
Check if an XML string A is well formed
Validates XML string input.
A
Get encoding of file A
Guess the best encoding (UTF-8) for the given file.
A
Dameraulevenshtein distance between String A and String B
Calculates Dameraulevenshtein distance between strings.
A and B
NeedlemanWunsch distance between String A and String B
Calculates NeedlemanWunsch distance between strings.
A and B
Jaro similitude between String A and String B
Returns the Jaro similarity coefficient between two strings.
A and B
JaroWinkler similitude between String A and String B
Returns the Jaro similarity coefficient between two strings.
A and B
SoundEx of String A
Encodes a string into a Soundex value.
A
RefinedSoundEx of String A
Retrieves the Refined Soundex code for a given string object
A
Date A + B Hours
Add B hours to Date field A
A and B
Date A + B Minutes
Add B minutes to Date field A
A and B
Date A - Date B (milliseconds)
Subtract B milliseconds from Date field A
A and B
Date A - Date B (seconds)
Subtract B seconds from Date field A
A and B
Date A - Date B (minutes)
Subtract B minutes from Date field A
A and B
Date A - Date B (hours)
Subtract B hours from Date field A
A and B