Binary Count Tree: An Efficient and Compact Structure for Mining Rare and Frequent Itemsets

 

Shwetha Rai,1 Geetha M.,1,* Preetham Kumar2 and Giridhar B.1

 

[1]Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka 576104, India.

2Department of Information and Communication Technology, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka 576104, India.

*E-mail: geetha.maiya@manipal.edu (M. Geetha)

 

Abstract

 

The discovery of rare and frequent itemsets is done efficiently if the datasets to be processed are stored within the main memory. In recent years, various data structures have been developed to represent a large dataset in a compact form, which otherwise cannot be stored as a whole within the main memory. Binary Count Tree (BIN-Tree), a tree data structure is proposed in this paper, represents the entire dataset in a compact and complete form without any information loss. Each transaction is encoded and stored as a node in the tree, in contrast to the existing algorithms that store each item as a node. The efficiency of BIN-Tree for datasets of varying size and dimensions was evaluated against Single Scan Pattern Tree (SSP-Tree) and Weighted Count Tree (WC-Tree). The results obtained revealed BIN-Tree to be 95% and 75% more space-efficient than SSP-Tree and WC-Tree, respectively. The BIN-Tree construction and discovery of itemsets from a large dataset were found to be 93% and 22% more time-efficient than SSP-Tree and WC-Tree, respectively. BIN-Tree is equally efficient to discover rare and frequent itemsets from a small dataset in the main memory.

 

Table of Contents

Diagram

Description automatically generated

Keywords: Association Rule Mining; BIN-Tree; Frequent itemsets; Rare itemsets; Tree data structure.

 



1. Introduction

Rare and frequent itemsets mining is a data mining task for discovering associations among itemsets in the database and forms an integral part of association rule mining (ARM) algorithm. Although inspired by supermarket-basket analysis, ARM has found applications in a variety of fields and has become a domain-independent technique.[1] It draws association rules among frequent and/or rare itemsets and aims to derive knowledge from different transactional databases.[2,3] The knowledge derived has contributed immensely to achieve excellence in respective fields of business, science, and technology. In later years, ARM was extended to find the association among the rare itemsets to uncover the previously ignored, but important relationships among the itemsets. An itemset is said to be frequent if it satisfies the user defined support threshold,[4] otherwise, it is considered as rare.[5] Most ARM algorithms are designed for small static datasets and require scanning the database multiple times to discover frequent and/or rare itemsets.[6,7]

In the current era, the volume of data generated has increased exponentially with the rigorous use of digital technologies. These data are acquired from various sources in the form of invoices, medical reports, social network platforms, and scientific observations among others.[8] Majority of educational, healthcare and industry sectors have been forced to carry out their activities online due to pandemic and this shift from conventional mode has led to an enormous amount of data being generated.[9] Interesting information can be obtained by analyzing the patterns in these data. However, the volume and variety of data generated cannot be manually processed and multi-scan-based ARM algorithms require enormous amount of time to derive the associations among various itemsets. To accelerate the computations, single scan-based technique can be employed to discover rare and frequent itemsets from the dataset.

Data structures play a key role in storage and discovery of interesting and meaningful information from the input dataset. Useful information can be discovered efficiently if the entire dataset is stored in the main memory and number database scans are reduced. Several data structures and algorithms were presented to discover rare and frequent itemsets in a single scan. Perfect hashing and pruning algorithm (PHP), which is similar to direct hashing and pruning algorithm, stores the data in hash table without collision. PHP utilized recursive hashing, a method similar to lexicographical ordered tree to discover frequent itemsets.[10,11] Eclat algorithm represented the database in vertical format and added the transaction identifier to the item $i$ if and only if $i$ was present in that transaction. Several variations of this algorithm such as Tidset and Diffset have been implemented to discover the frequent itemsets.[12] A Pattern-tree was constructed by inserting the alphabetically sorted transaction items in a single scan. After the entire database was read, the list containing the items was sorted and based on the updated list, each path in the tree was restructured.[13]

Ananthanarayana et al. implemented Pattern Count tree algorithm to discover frequent itemsets from the general and complete tree in a single database scan. From this tree, a Lexicographically Ordered FP-Tree was constructed to discover frequent itemsets.[14] Geetha M. et al. designed and implemented weighted count (WC) tree using prime number concept to store all transaction items in the main memory to discover frequent concepts. Each transaction item was encoded using a unique prime number and its product was stored in the node of tree. The memory space used to store the dataset in the main memory is reduced significantly as each node represents a transaction in contrast with other algorithms where each node represents an item. But there was information loss when the number of items per transaction was more.[15] Compact graph was implemented using a graph data structure to store the database efficiently in the main memory. Each node in the graph represented an item and the edge between the nodes represented the transactions using data encoding. It was reported to be memory efficient than Frequent Pattern Tree.[7] However, the tree construction and mining were inefficient due to the need for encoding/decoding of data in the graph.[16] Single Scan Frequent Itemset Mining algorithm implemented by Youcef Djenouri et al. discovered frequent itemsets in a single scan. The transactions were read from the database sequentially and the subsets of all the items in the transaction generated were stored in the hash table along with its corresponding support count.[17] Full Compression Frequent Pattern tree used a compressed tree data structure to store all itemsets from the database. Two head tables were created for storing frequent and infrequent itemsets, respectively in the sorted order with respect to their support count. The itemsets in the tree were restructured based on its position in the head table. All single paths in FCFP tree were compressed to a single node by storing all the itemsets in that path in the string format.[18] An Improved Apriori algorithm scans the database and constructs a 2-dimensional matrix, called 0-1 matrix. Each transaction was represented as a row and each column represented the items in the transaction. The presence of item in the transaction database was represented as 1 and its absence as 0.[19] Postdiffset algorithm discovered frequent itemset similar to Diffset algorithm for obtaining the result diffset.[20,21] R-Eclat algorithm discovered rare itemsets from the database using vertical representation of the transaction database. If the support(itemset) ≤ min_freq then in IF-Tidset, (i ∩ (i+1)) was computed for ith and (i+1)th columns, whereas in IF-Diffset, diffset (i ∩ (i+1)) was computed for ith and (i+1)th columns, and in IF-Sortdiffset, diffset (i ∩ (i+1)) was computed for ith and (i+1)th columns after sorting the itemsets in descending order depending on the largest to smallest value in equivalence class of the itemset.[22] Single scan pattern-tree (SSP-Tree) was constructed using a single scan of the database. The tree was restructured based on the frequency of the itemsets stored in header table before insertion of the next transaction into the tree to maintain the order of the items in the tree. SSP-Tree algorithm was also used to store the data and find outliers in the medical records using incremental mining.[23,24]

Several existing ARM algorithms discover either frequent or rare itemsets and the association among them. However, a very few ARM algorithms discover both rare and frequent itemsets while storing the entire database in the main memory without losing the actual information. Some of the algorithms require restructuring of the tree or re-scanning of the database due to change in the support threshold. Hence, a novel binary encoded data structure, Binary count tree, to store the entire database in the main memory is proposed in this paper. Binary Count Tree is a complete and compact representation of the entire database to discover rare and frequent itemsets using a single database scan without loss of any information. It does not require re-scanning of the database when there is a change in the support threshold. The steps for binary count tree construction and discovery of rare and frequent itemsets from the tree are presented in section 2. The experimental results and its corresponding analysis are discussed in section 3. Section 4 concludes the paper and provides some input on the future research directions.

 

2. Methodology

Binary count tree (BIN-Tree), a compact and complete data structure, is proposed in this work for storing the entire database in main memory in a single database. The data structure enables the discovery of both rare and frequent itemsets efficiently from the tree without loss of information even when the support threshold is changed.

Fig. 1 Structure of a node with Weight, Count, Child, and Next pointer in BIN-Tree.

 

2.1 Structure of a node in Binary count tree

The structure of a node in BIN-Tree contains four fields as shown in Fig. 1. The Weight field in the node holds the information of a transaction in the database, count field holds the support count of the transaction, Child pointer holds the address of child node and next pointer holds the address of sibling node. Hence, each node in the tree represents a transaction in the data set.

 

2.2 Construction of Binary Count Tree

BIN-Tree is constructed in a single database scan where the transactions are read sequentially and encoded in the binary format. The steps to construct Binary count tree are as follows:

  1. Read items in the transaction one at a time and indicate its presence or absence in the bitset using 1 and 0 respectively. The bitset forms the Weight of the transaction.

For example, if items are numbered as 1, 3, 4 and 6 for a particular transaction then the corresponding bitset representation would be 1011010.

  1. If the tree is empty then, create a node with the associated Weight, initialise count to 1 and attach it to the child of the root.
  2. If tree is not empty, start traversing from the child of the root and traverse the tree based on the following cases until all transactions are inserted.

(a)    If the traversed nodeís Weight is equal to the transaction Weight, increase the count of the traversed node by a value of 1.

(b)    If the traversed node's Weight is a proper subset of the Weight of the transaction, then move to the child of the traversed node, if it is not NULL. Otherwise, if the child of the traversed node is NULL, remove all common factors between parent and transaction, create a node similar to step 2 and add the created node as the child of the traversed node.

(c)    If the transactionís Weight is a proper subset of the Weight of the traversed node, then create a node similar to step 2 and add this as the parent of the traversed node. Move all the siblings of the traversed node whose values are not proper subset of Weight of the transaction as the siblings of the created node. Lastly, remove the factors common to the parent and child from the child node.

(d)    If transaction Weight and Weight of the traversed node are proper subset of each other, move to the right of the traversed node, if it is not NULL. If it is NULL, then create a node similar to step 2 and add it to the right of the traversed node.

 

2.3 Mining itemsets from Binary Count Tree

The rare and frequent itemsets from the BIN-Tree are discovered using following steps:

1.       Create a Hash Table with a prime number based on dataset and create a secondary Hash Table for each entry in the primary hash table using a different prime number.

2.       For each itemset in a node in the tree, generate the subsets of items marked as present. If the node is a child node, then perform a join operation with subset generated from the Weight of parent node.

3.       Add the elements of each subset, map this sum using the above two primes and update it in the Hash Table.

4.       If the value is already present in the hash table, then add the count in the hash table with the count of the traversed node.

5.       Otherwise, insert the value to the hash table initializing its count to the count of the traversed node.

6.       Once all the nodes are traversed, iterate through the hash table, and display all the table entries whose count are between the two threshold values set by the user.

 

2.4 Illustration of the Binary Count Tree

BIN-Tree constructed for the sample database in Table 1 is shown in Fig. 2. The first transaction with items {1, 2} is read and a vector with sixty-four zeroes followed by 110 is created. The vector of zeroes and ones forms the Weight of the transaction. The first transaction is inserted as the first child of root node and its support is initialized to 1. The Weight of the second transaction is calculated and inserted as sibling node of first transaction because the superset () relation (refer definition 1.9 in supplementary material) does not apply on transaction 1 and transaction 2. Transaction 3 is also inserted similar to transaction 2. Transaction 4 is inserted as the child of transaction 1 because of the relation Weight(T4) Weight(T1). The common factors {1, 2} are removed from the Weight of transaction 4 and the support of node containing transaction 1 is incremented by 1 before inserting transaction 4. Remaining nodes are inserted into the tree based on relation between the Weight of transactions.

Table 1. Sample database with transactions' Weight.

TID

Transaction items

Binary Representation (Transaction Weight)

1

1,2

00...0110

2

3,4

00...011000

3

1,3

00...01010

4

1,2,3

00...01110

5

4

00...010000

6

1,2,3,65,66,67

1110...01110

7

1,2,3,57,58,59,60,61

00...0111110...01110

8

1,2,3,62,63,64

00...01110...01110

The rare and frequent itemsets generated for the sample database given in Table 1 with support thresholds minfreq= 50% and minrare= 20% are given as a combination of {itemset: Support_count}.

  1. Frequent itemsets: