Source: http://repository.seasr.org/Meandre/Repositories/Demo-Flows/FPGrowth/repository.rdf
URI: http://seasr.org/flows/fpgrowth/
Name: FPGrowth
Creator: admin
Date: 2008-07-25 (00:18:35)
Rights: UofI/NCSA
Tags: rule association, frequent pattern, discovery, visualization, pattern mining, pmml
Description: This flow loads a delimited data set into a table. The first row has attribute labels and the second row has attribute types. Input csv file from webdav, find frequent patterns with fpgrowth, convert rules to pmml to communicate with the rulevis applet.

<br>

Overview

Frequent pattern mining is an unsupervised learning approach that seeks to discover significant relationships among variables in a data set. Frequent pattern mining or rule association has also been called market basket analysis because of its application to the retail sales domain. The visualization of significant relationships are represented at two levels, structural and quantitative. At the structural level, the model will indicate which variables are locally dependent on one another. At the quantitative level, the model will offer some numeric measure of support and confidence for these relationships.

  • Find all rules that correlate the presence of one set of items X with another item Y. For example, if a customer buys bread and butter, then they buy milk 85% of the time.
  • Support is the percentage of the records that contain both X and Y. A rule must have minimum user-specified support to show its impact.
  • Confidence is the percentage of the records that contain X and Y out of the number of records that contain X. A rule must have some minimum user specified confidence to show its value.

Application

This flow shows the resulting rules that indicate relationships among attributes about mushrooms.

Resulting Visualization

Other Applications

The number of applications for this method continue to grow and include such things as:

  • Determining what text terms co-exist together in a data collection.
  • Determining what services are purchased together.
  • Determining what products or transactions are executed by customers on a single visit to a website.
  • Determining previously unknown relationships in the data.

Detailed Description

This flow loads data, manipulate the data and extract rules. An itemset is a collection of items, and an item is an attribute-value pair that exists in the dataset. A data table can be used to build multiple rule tables with different combinations of attributes or with different support or items per itemset values. A rule has two parts, the rule antecedent and the rule consequent.

This flow loads the data, bins the data (if necessary), and generates rules. The rules can be viewed in a graphical representation. The data is loaded by passing a url of the data location, so this flow can be easily modified to load your data. This data has two rows at the top that indicate the labels row and the types row. These parameters can be modified by adjusting the values in “Create Delimited File Parser”.

The “FPGrowth” component implements the FPGrowth algorithm to generate frequent itemsets consisting of items that occur in a sufficient number of examples to satisfy the minimum support criteria.

  • Minimum Support % is the percent of all examples that must contain a given set of items before an association rule will be formed containing those items. This value must be greater than 0 and less than or equal to 100.
  • Maximum Items per Rule is the maximum number of items to include in any rule. This setting does not impact performance for this algorithm as it does for the Apriori algorithm. This setting cannot be less than 2.
  • Generate Verbose Output should be set to TRUE if the module should report progress information to the console.
  • Generate Debug Output should be set to TRUE if the module should write verbose status information to the console.

This “Compute Confidence” component works in conjunction with other components implementing the Apriori or FP Growth rule association algorithms to generate association rules satisfying a minimum confidence threshold.

  • Minimum Confidence % is the percent of the examples containing a rule antecedent that must also contain the rule consequent before a potential association rule is accepted. This value must be greater than 0 and less than or equal to 100.
  • Report Module Progress should be set to TRUE if the component should report progress information to the console.
  • Generate Verbose Output should be set to TRUE if the component should write verbose status information to the console.

When executing the flow, the “Choose Attributes” web user interface will prompt the user to identify the input and output attributes. Use Shift to select a range of attributes. Use Control to select and/or deselect an attribute. Select (highlight) the attributes that should be used for input and the output attribute. Also the File menu allows for different sorting options. When selections are complete, click the Done button.

Note: For this application, we use input and output selections to choose the attribute-value pairs for the rule antecedent and the rule consequent values, respectively.

For this flow, the data is all categorical, so we do not need to bin the data. However, if we had numerical data, we would need to bin the data.

Once execution has completed the console window contains information about the results of this analysis. The resulting visualization will open in a browser as an applet. This visualization presents a graphical representation of the result of the association rule algorithm. The main region of the display contains a matrix that visually depicts the rules. Each numbered column in the matrix corresponds to an association rule that met the minimum support and confidence requirements specified by the user
in the rule discovery modules. Items used in the rules, that is attribute-value pairs, are listed along the left side of the matrix. Note that some items in the original data set may not be included in any rule because there was insufficient support and/or confidence to consider the item significant.
An icon in the matrix cell indicates that an item is included in a rule. If the matrix cell icon is a box, then the item is part of the rule antecedent. If the icon is a check mark, then the item is part of the rule consequent.

Above the main matrix are two rows of bars labeled Confidence and Support. These bars align with the corresponding rule columns in the main matrix. For any given rule, the confidence and support values are represented by the degree to which the bars above the rule column are fi lled in. Brushing
the mouse on a confidence or support bar displays the exact value that is graphically represented by the bar height.

The rules can be ordered by confidence or by support. To sort the rules, click either the support or confidence label - these labels are clickable radio buttons. If support is selected the rules will be sorted using support as the primary key and confidence as the secondary key. Conversely, if the confidence button is chosen, confidence is the primary sort key and support is the secondary key.

Directly above the confidence and support display is a toolbar that provides additional functionality. On the left side of the toolbar are two buttons that allow the rows of the table to be displayed according to different sorting schemes. One of the buttons is active at all times. The Alphabetize button sorts the attribute-value combinations alphabetically. The Rank button sorts the rows based on the current Confidence/Support selection, moving the consequents and antecedents of the highest ranking rules to the top of the attribute-value list.

On the right side of the toolbar are four additional buttons. Restore Original reverts back to the original table that was displayed before any sorting was done. Filter provides an interface that allows the user to display a subset of the generated rules. Filtering is not part of this release. Print prints a screen capture of the visual display. The print output contains only the cells that are visible in the display window, not all the cells in the rule table. Printing is also accessible via the Options menu. Help displays information describing the visualization.

View of Flow

Data Type Restrictions

Every attribute-value combination is compared, so numerical attributes need to be binned.

Data Handling

There are several data transformations that occur in this processing.

Scalability

Every attribute-value combination is compared, so a large number of attribute-value pairs can cause take a long time to execute and there is also the change that you will get an “Out of Memory” error.

Execution Criteria

None.

References

Han, J., J. Pei, and Y. Yin. “Mining frequent patterns without candidate generation.” ACM SIGMOD Record 29.2 (2000): 1-12.


Leave a Reply