Association Rules Mining Using Python Generators to Handle Large Datasets

Motivation

I was looking to run association analysis in Python using the apriori algorithm to derive rules of the form {A} -> {B}. However, I quickly discovered that it's not part of the standard Python machine learning libraries. Although there are some implementations that exist, I could not find one capable of handling large datasets. "Large" in my case was an orders dataset with 32 million records, containing 3.2 million unique orders and about 50K unique items (file size just over 1 GB). So, I decided to write my own implementation, leveraging the apriori algorithm to generate simple {A} -> {B} association rules. Since I only care about understanding relationships between any given pair of items, using apriori to get to item sets of size 2 is sufficient. I went through various iterations, splitting the data into multiple subsets just so I could get functions like crosstab and combinations to run on my machine with 8 GB of memory. :) But even with this approach, I could only process about 1800 items before my kernel would crash... And that's when I learned about the wonderful world of Python generators.

Python Generators

In a nutshell, a generator is a special type of function that returns an iterable sequence of items. However, unlike regular functions which return all the values at once (eg: returning all the elements of a list), a generator yields one value at a time. To get the next value in the set, we must ask for it - either by explicitly calling the generator's built-in "next" method, or implicitly via a for loop. This is a great property of generators because it means that we don't have to store all of the values in memory at once. We can load and process one value at a time, discard when finished and move on to process the next value. This feature makes generators perfect for creating item pairs and counting their frequency of co-occurence. Here's a concrete example of what we're trying to accomplish:

  1. Get all possible item pairs for a given order

    eg:  order 1:  apple, egg, milk    -->  item pairs: {apple, egg}, {apple, milk}, {egg, milk}
         order 2:  egg, milk           -->  item pairs: {egg, milk}
  2. Count the number of times each item pair appears

    eg: {apple, egg}: 1
        {apple, milk}: 1
        {egg, milk}: 2

Here's the generator that implements the above tasks:

In [1]:
import numpy as np
from itertools import combinations, groupby
from collections import Counter

# Sample data
orders = np.array([[1,'apple'], [1,'egg'], [1,'milk'], [2,'egg'], [2,'milk']], dtype=object)

# Generator that yields item pairs, one at a time
def get_item_pairs(order_item):
    
    # For each order, generate a list of items in that order
    for order_id, order_object in groupby(orders, lambda x: x[0]):
        item_list = [item[1] for item in order_object]      
    
        # For each item list, generate item pairs, one at a time
        for item_pair in combinations(item_list, 2):
            yield item_pair                                      


# Counter iterates through the item pairs returned by our generator and keeps a tally of their occurrence
Counter(get_item_pairs(orders))
Out[1]:
Counter({('apple', 'egg'): 1, ('apple', 'milk'): 1, ('egg', 'milk'): 2})

get_item_pairs() generates a list of items for each order and produces item pairs for that order, one pair at a time. The first item pair is passed to Counter which keeps track of the number of times an item pair occurs. The next item pair is taken, and again, passed to Counter. This process continues until there are no more item pairs left. With this approach, we end up not using much memory as item pairs are discarded after the count is updated.

Apriori Algorithm

Apriori is an algorithm used to identify frequent item sets (in our case, item pairs). It does so using a "bottom up" approach, first identifying individual items that satisfy a minimum occurence threshold. It then extends the item set, adding one item at a time and checking if the resulting item set still satisfies the specified threshold. The algorithm stops when there are no more items to add that meet the minimum occurrence requirement. Here's an example of apriori in action, assuming a minimum occurence threshold of 3:

order 1: apple, egg, milk  
order 2: carrot, milk  
order 3: apple, egg, carrot
order 4: apple, egg
order 5: apple, carrot


Iteration 1:  Count the number of times each item occurs   
item set      occurrence count    
{apple}              4   
{egg}                3   
{milk}               2   
{carrot}             2   

{milk} and {carrot} are eliminated because they do not meet the minimum occurrence threshold.


Iteration 2: Build item sets of size 2 using the remaining items from Iteration 1 (ie: apple, egg)  
item set           occurence count  
{apple, egg}             3  

Only {apple, egg} remains and the algorithm stops since there are no more items to add.

If we had more orders and items, we can continue to iterate, building item sets consisting of more than 2 elements. For the problem we are trying to solve (ie: finding relationships between pairs of items), it suffices to implement apriori to get to item sets of size 2.

Association Rules Mining

Once the item sets have been generated using apriori, we can start mining association rules. Given that we are only looking at item sets of size 2, the association rules we will generate will be of the form {A} -> {B}. One common application of these rules is in the domain of recommender systems, where customers who purchased item A are recommended item B.

Here are 3 key metrics to consider when evaluating association rules:

  1. support
    This is the percentage of orders that contains the item set. In the example above, there are 5 orders in total and {apple,egg} occurs in 3 of them, so:

                 support{apple,egg} = 3/5 or 60%

    The minimum support threshold required by apriori can be set based on knowledge of your domain. In this grocery dataset for example, since there could be thousands of distinct items and an order can contain only a small fraction of these items, setting the support threshold to 0.01% may be reasonable.

  2. confidence
    Given two items, A and B, confidence measures the percentage of times that item B is purchased, given that item A was purchased. This is expressed as:

                 confidence{A->B} = support{A,B} / support{A}

    Confidence values range from 0 to 1, where 0 indicates that B is never purchased when A is purchased, and 1 indicates that B is always purchased whenever A is purchased. Note that the confidence measure is directional. This means that we can also compute the percentage of times that item A is purchased, given that item B was purchased:

                 confidence{B->A} = support{A,B} / support{B}

    In our example, the percentage of times that egg is purchased, given that apple was purchased is:

                 confidence{apple->egg} = support{apple,egg} / support{apple}
                                        = (3/5) / (4/5)
                                        = 0.75 or 75%

    A confidence value of 0.75 implies that out of all orders that contain apple, 75% of them also contain egg. Now, we look at the confidence measure in the opposite direction (ie: egg->apple):

                 confidence{egg->apple} = support{apple,egg} / support{egg}
                                        = (3/5) / (3/5)
                                        = 1 or 100%  

    Here we see that all of the orders that contain egg also contain apple. But, does this mean that there is a relationship between these two items, or are they occurring together in the same orders simply by chance? To answer this question, we look at another measure which takes into account the popularity of both items.

  3. lift
    Given two items, A and B, lift indicates whether there is a relationship between A and B, or whether the two items are occuring together in the same orders simply by chance (ie: at random). Unlike the confidence metric whose value may vary depending on direction (eg: confidence{A->B} may be different from confidence{B->A}), lift has no direction. This means that the lift{A,B} is always equal to the lift{B,A}:

                 lift{A,B} = lift{B,A} = support{A,B} / (support{A} * support{B})

    In our example, we compute lift as follows:

                 lift{apple,egg} = lift{egg,apple} = support{apple,egg} / (support{apple} * support{egg})
                                                   = (3/5) / (4/5 * 3/5) 
                                                   = 1.25

    One way to understand lift is to think of the denominator as the likelihood that A and B will appear in the same order if there was no relationship between them. In the example above, if apple occurred in 80% of the orders and egg occurred in 60% of the orders, then if there was no relationship between them, we would expect both of them to show up together in the same order 48% of the time (ie: 80% * 60%). The numerator, on the other hand, represents how often apple and egg actually appear together in the same order. In this example, that is 60% of the time. Taking the numerator and dividing it by the denominator, we get to how many more times apple and egg actually appear in the same order, compared to if there was no relationship between them (ie: that they are occurring together simply at random).

    In summary, lift can take on the following values:

     * lift = 1 implies no relationship between A and B. 
       (ie: A and B occur together only by chance)
    
     * lift > 1 implies that there is a positive relationship between A and B.
       (ie:  A and B occur together more often than random)
    
     * lift < 1 implies that there is a negative relationship between A and B.
       (ie:  A and B occur together less often than random)

    In our example, apple and egg occur together 1.25 times more than random, so we conclude that there exists a positive relationship between them.

Armed with knowledge of apriori and association rules mining, let's dive into the data and code to see what relationships we unravel!

Input Dataset

Instacart, an online grocer, has graciously made some of their datasets accessible to the public. The order and product datasets that we will be using can be downloaded from the link below, along with the data dictionary:

“The Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on September 1, 2017.

In [2]:
import pandas as pd
import numpy as np
import sys
from itertools import combinations, groupby
from collections import Counter
In [3]:
# Function that returns the size of an object in MB
def size(obj):
    return "{0:.2f} MB".format(sys.getsizeof(obj) / (1000 * 1000))

Part 1: Data Preparation

A. Load order data

In [4]:
orders = pd.read_csv('order_products__prior.csv')
print('orders -- dimensions: {0};   size: {1}'.format(orders.shape, size(orders)))
display(orders.head())
orders -- dimensions: (32434489, 4);   size: 1037.90 MB
order_id product_id add_to_cart_order reordered
0 2 33120 1 1
1 2 28985 2 1
2 2 9327 3 0
3 2 45918 4 1
4 2 30035 5 0

B. Convert order data into format expected by the association rules function

In [5]:
# Convert from DataFrame to a Series, with order_id as index and item_id as value
orders = orders.set_index('order_id')['product_id'].rename('item_id')
display(orders.head(10))
type(orders)
order_id
2    33120
2    28985
2     9327
2    45918
2    30035
2    17794
2    40141
2     1819
2    43668
3    33754
Name: item_id, dtype: int64

pandas.core.series.Series

C. Display summary statistics for order data

In [6]:
print('orders -- dimensions: {0};   size: {1};   unique_orders: {2};   unique_items: {3}'
      .format(orders.shape, size(orders), len(orders.index.unique()), len(orders.value_counts())))
orders -- dimensions: (32434489,);   size: 518.95 MB;   unique_orders: 3214874;   unique_items: 49677

Part 2: Association Rules Function

A. Helper functions to the main association rules function

In [7]:
# Returns frequency counts for items and item pairs
def freq(iterable):
    if type(iterable) == pd.core.series.Series:
        return iterable.value_counts().rename("freq")
    else: 
        return pd.Series(Counter(iterable)).rename("freq")

    
# Returns number of unique orders
def order_count(order_item):
    return len(set(order_item.index))


# Returns generator that yields item pairs, one at a time
def get_item_pairs(order_item):
    order_item = order_item.reset_index().as_matrix()
    for order_id, order_object in groupby(order_item, lambda x: x[0]):
        item_list = [item[1] for item in order_object]
              
        for item_pair in combinations(item_list, 2):
            yield item_pair
            

# Returns frequency and support associated with item
def merge_item_stats(item_pairs, item_stats):
    return (item_pairs
                .merge(item_stats.rename(columns={'freq': 'freqA', 'support': 'supportA'}), left_on='item_A', right_index=True)
                .merge(item_stats.rename(columns={'freq': 'freqB', 'support': 'supportB'}), left_on='item_B', right_index=True))


# Returns name associated with item
def merge_item_name(rules, item_name):
    columns = ['itemA','itemB','freqAB','supportAB','freqA','supportA','freqB','supportB', 
               'confidenceAtoB','confidenceBtoA','lift']
    rules = (rules
                .merge(item_name.rename(columns={'item_name': 'itemA'}), left_on='item_A', right_on='item_id')
                .merge(item_name.rename(columns={'item_name': 'itemB'}), left_on='item_B', right_on='item_id'))
    return rules[columns]               

B. Association rules function

In [8]:
def association_rules(order_item, min_support):

    print("Starting order_item: {:22d}".format(len(order_item)))


    # Calculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100


    # Filter from order_item items below min support 
    qualifying_items       = item_stats[item_stats['support'] >= min_support].index
    order_item             = order_item[order_item.isin(qualifying_items)]

    print("Items with support >= {}: {:15d}".format(min_support, len(qualifying_items)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    # Filter from order_item orders with less than 2 items
    order_size             = freq(order_item.index)
    qualifying_orders      = order_size[order_size >= 2].index
    order_item             = order_item[order_item.index.isin(qualifying_orders)]

    print("Remaining orders with 2+ items: {:11d}".format(len(qualifying_orders)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    # Recalculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100


    # Get item pairs generator
    item_pair_gen          = get_item_pairs(order_item)


    # Calculate item pair frequency and support
    item_pairs              = freq(item_pair_gen).to_frame("freqAB")
    item_pairs['supportAB'] = item_pairs['freqAB'] / len(qualifying_orders) * 100

    print("Item pairs: {:31d}".format(len(item_pairs)))


    # Filter from item_pairs those below min support
    item_pairs              = item_pairs[item_pairs['supportAB'] >= min_support]

    print("Item pairs with support >= {}: {:10d}\n".format(min_support, len(item_pairs)))


    # Create table of association rules and compute relevant metrics
    item_pairs = item_pairs.reset_index().rename(columns={'level_0': 'item_A', 'level_1': 'item_B'})
    item_pairs = merge_item_stats(item_pairs, item_stats)
    
    item_pairs['confidenceAtoB'] = item_pairs['supportAB'] / item_pairs['supportA']
    item_pairs['confidenceBtoA'] = item_pairs['supportAB'] / item_pairs['supportB']
    item_pairs['lift']           = item_pairs['supportAB'] / (item_pairs['supportA'] * item_pairs['supportB'])
   

    # Return association rules sorted by lift in descending order
    return item_pairs.sort_values('lift', ascending=False)

Part 3: Association Rules Mining

In [9]:
%%time
rules = association_rules(orders, 0.01)  
Starting order_item:               32434489
Items with support >= 0.01:           10906
Remaining order_item:              29843570
Remaining orders with 2+ items:     3013325
Remaining order_item:              29662716
Item pairs:                        30622410
Item pairs with support >= 0.01:      48751

CPU times: user 9min 26s, sys: 34.5 s, total: 10min 1s
Wall time: 10min 13s
In [10]:
# Replace item ID with item name and display association rules
item_name   = pd.read_csv('products.csv')
item_name   = item_name.rename(columns={'product_id':'item_id', 'product_name':'item_name'})
rules_final = merge_item_name(rules, item_name).sort_values('lift', ascending=False)
display(rules_final)
itemA itemB freqAB supportAB freqA supportA freqB supportB confidenceAtoB confidenceBtoA lift
0 Organic Strawberry Chia Lowfat 2% Cottage Cheese Organic Cottage Cheese Blueberry Acai Chia 306 0.010155 1163 0.038595 839 0.027843 0.263113 0.364720 9.449868
1 Grain Free Chicken Formula Cat Food Grain Free Turkey Formula Cat Food 318 0.010553 1809 0.060033 879 0.029170 0.175788 0.361775 6.026229
3 Organic Fruit Yogurt Smoothie Mixed Berry Apple Blueberry Fruit Yogurt Smoothie 349 0.011582 1518 0.050376 1249 0.041449 0.229908 0.279424 5.546732
9 Nonfat Strawberry With Fruit On The Bottom Gre... 0% Greek, Blueberry on the Bottom Yogurt 409 0.013573 1666 0.055288 1391 0.046162 0.245498 0.294033 5.318230
10 Organic Grapefruit Ginger Sparkling Yerba Mate Cranberry Pomegranate Sparkling Yerba Mate 351 0.011648 1731 0.057445 1149 0.038131 0.202773 0.305483 5.317849
11 Baby Food Pouch - Roasted Carrot Spinach & Beans Baby Food Pouch - Butternut Squash, Carrot & C... 332 0.011018 1503 0.049878 1290 0.042810 0.220892 0.257364 5.159830
12 Unsweetened Whole Milk Mixed Berry Greek Yogurt Unsweetened Whole Milk Blueberry Greek Yogurt 438 0.014535 1622 0.053828 1621 0.053794 0.270037 0.270204 5.019798
23 Uncured Cracked Pepper Beef Chipotle Beef & Pork Realstick 410 0.013606 1839 0.061029 1370 0.045465 0.222947 0.299270 4.903741
24 Organic Mango Yogurt Organic Whole Milk Washington Black Cherry Yogurt 334 0.011084 1675 0.055586 1390 0.046128 0.199403 0.240288 4.322777
2 Grain Free Chicken Formula Cat Food Grain Free Turkey & Salmon Formula Cat Food 391 0.012976 1809 0.060033 1553 0.051538 0.216142 0.251771 4.193848
25 Raspberry Essence Water Unsweetened Pomegranate Essence Water 366 0.012146 2025 0.067202 1304 0.043274 0.180741 0.280675 4.176615
13 Unsweetened Whole Milk Strawberry Yogurt Unsweetened Whole Milk Blueberry Greek Yogurt 440 0.014602 1965 0.065210 1621 0.053794 0.223919 0.271437 4.162489
14 Unsweetened Whole Milk Peach Greek Yogurt Unsweetened Whole Milk Blueberry Greek Yogurt 421 0.013971 1922 0.063783 1621 0.053794 0.219043 0.259716 4.071849
44 Oh My Yog! Pacific Coast Strawberry Trilayer Y... Oh My Yog! Organic Wild Quebec Blueberry Cream... 860 0.028540 2857 0.094812 2271 0.075365 0.301015 0.378688 3.994083
55 Mighty 4 Kale, Strawberry, Amaranth & Greek Yo... Mighty 4 Essential Tots Spinach, Kiwi, Barley ... 390 0.012943 2206 0.073208 1337 0.044370 0.176791 0.291698 3.984498
20 Unsweetened Whole Milk Peach Greek Yogurt Unsweetened Whole Milk Strawberry Yogurt 499 0.016560 1922 0.063783 1965 0.065210 0.259625 0.253944 3.981352
65 0% Greek, Blueberry on the Bottom Yogurt Nonfat Strawberry With Fruit On The Bottom Gre... 305 0.010122 1391 0.046162 1666 0.055288 0.219267 0.183073 3.965918
15 Unsweetened Whole Milk Mixed Berry Greek Yogurt Unsweetened Whole Milk Peach Greek Yogurt 410 0.013606 1622 0.053828 1922 0.063783 0.252774 0.213319 3.963014
43 Unsweetened Whole Milk Peach Greek Yogurt Unsweetened Whole Milk Mixed Berry Greek Yogurt 407 0.013507 1922 0.063783 1622 0.053828 0.211759 0.250925 3.934016
26 Unsweetened Blackberry Water Unsweetened Pomegranate Essence Water 494 0.016394 3114 0.103341 1304 0.043274 0.158638 0.378834 3.665867
19 Unsweetened Whole Milk Mixed Berry Greek Yogurt Unsweetened Whole Milk Strawberry Yogurt 383 0.012710 1622 0.053828 1965 0.065210 0.236128 0.194911 3.621024
16 Unsweetened Whole Milk Strawberry Yogurt Unsweetened Whole Milk Peach Greek Yogurt 444 0.014735 1965 0.065210 1922 0.063783 0.225954 0.231009 3.542526
56 Mighty 4 Sweet Potato, Blueberry, Millet & Gre... Mighty 4 Essential Tots Spinach, Kiwi, Barley ... 398 0.013208 2534 0.084093 1337 0.044370 0.157064 0.297681 3.539900
74 Sweet Potatoes Stage 2 Organic Stage 2 Winter Squash Baby Food Puree 322 0.010686 2077 0.068927 1322 0.043872 0.155031 0.243570 3.533734
79 Compostable Forks Plastic Spoons 321 0.010653 1528 0.050708 1838 0.060996 0.210079 0.174646 3.444151
75 Organic Stage 2 Carrots Baby Food Organic Stage 2 Winter Squash Baby Food Puree 337 0.011184 2306 0.076527 1322 0.043872 0.146141 0.254917 3.331080
42 Unsweetened Whole Milk Strawberry Yogurt Unsweetened Whole Milk Mixed Berry Greek Yogurt 352 0.011681 1965 0.065210 1622 0.053828 0.179135 0.217016 3.327938
21 Unsweetened Whole Milk Blueberry Greek Yogurt Unsweetened Whole Milk Strawberry Yogurt 350 0.011615 1621 0.053794 1965 0.065210 0.215916 0.178117 3.311071
17 Unsweetened Whole Milk Blueberry Greek Yogurt Unsweetened Whole Milk Peach Greek Yogurt 341 0.011316 1621 0.053794 1922 0.063783 0.210364 0.177419 3.298101
83 Cream Top Blueberry Yogurt Cream Top Peach on the Bottom Yogurt 313 0.010387 1676 0.055620 1748 0.058009 0.186754 0.179062 3.219399
... ... ... ... ... ... ... ... ... ... ... ...
22444 Large Lemon Hass Avocados 468 0.015531 152177 5.050136 49246 1.634274 0.003075 0.009503 0.001882
2577 Red Onion Bag of Organic Bananas 1008 0.033451 42906 1.423876 376367 12.490090 0.023493 0.002678 0.001881
250 Roasted Pine Nut Hummus Banana 327 0.010852 11176 0.370886 470096 15.600574 0.029259 0.000696 0.001876
655 Organic Large Green Asparagus Banana 556 0.018451 19228 0.638099 470096 15.600574 0.028916 0.001183 0.001854
40897 Banana Organic Extra Virgin Olive Oil 369 0.012246 470096 15.600574 12788 0.424382 0.000785 0.028855 0.001850
2652 Spinach Bag of Organic Bananas 383 0.012710 16766 0.556395 376367 12.490090 0.022844 0.001018 0.001829
2722 Sour Cream Bag of Organic Bananas 486 0.016128 21481 0.712867 376367 12.490090 0.022625 0.001291 0.001811
11143 Organic Blueberries Blueberries 329 0.010918 99359 3.297321 55703 1.848556 0.003311 0.005906 0.001791
2537 Green Onions Bag of Organic Bananas 592 0.019646 26467 0.878332 376367 12.490090 0.022367 0.001573 0.001791
1386 2% Reduced Fat Milk Organic Strawberries 574 0.019049 36768 1.220180 263416 8.741706 0.015611 0.002179 0.001786
3291 2% Reduced Fat Milk Organic Baby Spinach 523 0.017356 36768 1.220180 240637 7.985763 0.014224 0.002173 0.001781
530 Chocolate Chip Cookies Banana 377 0.012511 13688 0.454249 470096 15.600574 0.027542 0.000802 0.001765
10681 Half & Half Organic Half & Half 302 0.010022 68842 2.284586 75334 2.500029 0.004387 0.004009 0.001755
5446 Organic Reduced Fat 2% Milk Organic Whole Milk 379 0.012577 47593 1.579418 136832 4.540898 0.007963 0.002770 0.001754
11455 Banana Soda 864 0.028673 470096 15.600574 33008 1.095401 0.001838 0.026175 0.001678
11421 Bag of Organic Bananas Fridge Pack Cola 366 0.012146 376367 12.490090 18005 0.597513 0.000972 0.020328 0.001628
2568 Asparation/Broccolini/Baby Broccoli Bag of Organic Bananas 317 0.010520 16480 0.546904 376367 12.490090 0.019235 0.000842 0.001540
19596 Banana Organic Tortilla Chips 320 0.010619 470096 15.600574 13458 0.446616 0.000681 0.023778 0.001524
2319 Fridge Pack Cola Bag of Organic Bananas 341 0.011316 18005 0.597513 376367 12.490090 0.018939 0.000906 0.001516
11017 Organic Baby Spinach 2% Reduced Fat Milk 403 0.013374 240637 7.985763 36768 1.220180 0.001675 0.010961 0.001373
22572 Organic Raspberries Raspberries 322 0.010686 136621 4.533895 56858 1.886886 0.002357 0.005663 0.001249
11012 Organic Strawberries 2% Reduced Fat Milk 371 0.012312 263416 8.741706 36768 1.220180 0.001408 0.010090 0.001154
246 Soda Banana 531 0.017622 33008 1.095401 470096 15.600574 0.016087 0.001130 0.001031
11555 Banana Clementines 397 0.013175 470096 15.600574 29798 0.988874 0.000845 0.013323 0.000854
1474 Strawberries Organic Strawberries 706 0.023429 141805 4.705931 263416 8.741706 0.004979 0.002680 0.000570
7271 Organic Strawberries Strawberries 640 0.021239 263416 8.741706 141805 4.705931 0.002430 0.004513 0.000516
6763 Organic Hass Avocado Organic Avocado 464 0.015398 212785 7.061469 176241 5.848722 0.002181 0.002633 0.000373
4387 Organic Avocado Organic Hass Avocado 443 0.014701 176241 5.848722 212785 7.061469 0.002514 0.002082 0.000356
2596 Banana Bag of Organic Bananas 654 0.021704 470096 15.600574 376367 12.490090 0.001391 0.001738 0.000111
670 Bag of Organic Bananas Banana 522 0.017323 376367 12.490090 470096 15.600574 0.001387 0.001110 0.000089

48751 rows × 11 columns

Part 4: Conclusion

From the output above, we see that the top associations are not surprising, with one flavor of an item being purchased with another flavor from the same item family (eg: Strawberry Chia Cottage Cheese with Blueberry Acai Cottage Cheese, Chicken Cat Food with Turkey Cat Food, etc). As mentioned, one common application of association rules mining is in the domain of recommender systems. Once item pairs have been identified as having positive relationship, recommendations can be made to customers in order to increase sales. And hopefully, along the way, also introduce customers to items they never would have tried before or even imagined existed! If you wish to see the Python notebook corresponding to the code above, please click here.