Motivation¶

I was looking to run association analysis in Python using the apriori algorithm to derive rules of the form {A} -> {B}. However, I quickly discovered that it's not part of the standard Python machine learning libraries. Although there are some implementations that exist, I could not find one capable of handling large datasets. "Large" in my case was an orders dataset with 32 million records, containing 3.2 million unique orders and about 50K unique items (file size just over 1 GB). So, I decided to write my own implementation, leveraging the apriori algorithm to generate simple {A} -> {B} association rules. Since I only care about understanding relationships between any given pair of items, using apriori to get to item sets of size 2 is sufficient. I went through various iterations, splitting the data into multiple subsets just so I could get functions like crosstab and combinations to run on my machine with 8 GB of memory. :) But even with this approach, I could only process about 1800 items before my kernel would crash... And that's when I learned about the wonderful world of Python generators.

Python Generators¶

In a nutshell, a generator is a special type of function that returns an iterable sequence of items. However, unlike regular functions which return all the values at once (eg: returning all the elements of a list), a generator yields one value at a time. To get the next value in the set, we must ask for it - either by explicitly calling the generator's built-in "next" method, or implicitly via a for loop. This is a great property of generators because it means that we don't have to store all of the values in memory at once. We can load and process one value at a time, discard when finished and move on to process the next value. This feature makes generators perfect for creating item pairs and counting their frequency of co-occurence. Here's a concrete example of what we're trying to accomplish:

Get all possible item pairs for a given order

eg:  order 1:  apple, egg, milk    -->  item pairs: {apple, egg}, {apple, milk}, {egg, milk}
     order 2:  egg, milk           -->  item pairs: {egg, milk}

Count the number of times each item pair appears

eg: {apple, egg}: 1
    {apple, milk}: 1
    {egg, milk}: 2

Here's the generator that implements the above tasks:

In [1]:

import numpy as np
from itertools import combinations, groupby
from collections import Counter

# Sample data
orders = np.array([[1,'apple'], [1,'egg'], [1,'milk'], [2,'egg'], [2,'milk']], dtype=object)

# Generator that yields item pairs, one at a time
def get_item_pairs(order_item):
    
    # For each order, generate a list of items in that order
    for order_id, order_object in groupby(orders, lambda x: x[0]):
        item_list = [item[1] for item in order_object]      
    
        # For each item list, generate item pairs, one at a time
        for item_pair in combinations(item_list, 2):
            yield item_pair                                      


# Counter iterates through the item pairs returned by our generator and keeps a tally of their occurrence
Counter(get_item_pairs(orders))

Out[1]:

Counter({('apple', 'egg'): 1, ('apple', 'milk'): 1, ('egg', 'milk'): 2})

get_item_pairs() generates a list of items for each order and produces item pairs for that order, one pair at a time. The first item pair is passed to Counter which keeps track of the number of times an item pair occurs. The next item pair is taken, and again, passed to Counter. This process continues until there are no more item pairs left. With this approach, we end up not using much memory as item pairs are discarded after the count is updated.

Apriori Algorithm¶

Apriori is an algorithm used to identify frequent item sets (in our case, item pairs). It does so using a "bottom up" approach, first identifying individual items that satisfy a minimum occurence threshold. It then extends the item set, adding one item at a time and checking if the resulting item set still satisfies the specified threshold. The algorithm stops when there are no more items to add that meet the minimum occurrence requirement. Here's an example of apriori in action, assuming a minimum occurence threshold of 3:

order 1: apple, egg, milk  
order 2: carrot, milk  
order 3: apple, egg, carrot
order 4: apple, egg
order 5: apple, carrot


Iteration 1:  Count the number of times each item occurs   
item set      occurrence count    
{apple}              4   
{egg}                3   
{milk}               2   
{carrot}             2   

{milk} and {carrot} are eliminated because they do not meet the minimum occurrence threshold.


Iteration 2: Build item sets of size 2 using the remaining items from Iteration 1 (ie: apple, egg)  
item set           occurence count  
{apple, egg}             3  

Only {apple, egg} remains and the algorithm stops since there are no more items to add.

If we had more orders and items, we can continue to iterate, building item sets consisting of more than 2 elements. For the problem we are trying to solve (ie: finding relationships between pairs of items), it suffices to implement apriori to get to item sets of size 2.

Association Rules Mining¶

Once the item sets have been generated using apriori, we can start mining association rules. Given that we are only looking at item sets of size 2, the association rules we will generate will be of the form {A} -> {B}. One common application of these rules is in the domain of recommender systems, where customers who purchased item A are recommended item B.

Here are 3 key metrics to consider when evaluating association rules:

support
This is the percentage of orders that contains the item set. In the example above, there are 5 orders in total and {apple,egg} occurs in 3 of them, so:
```
             support{apple,egg} = 3/5 or 60%
```
The minimum support threshold required by apriori can be set based on knowledge of your domain. In this grocery dataset for example, since there could be thousands of distinct items and an order can contain only a small fraction of these items, setting the support threshold to 0.01% may be reasonable.
confidence
Given two items, A and B, confidence measures the percentage of times that item B is purchased, given that item A was purchased. This is expressed as:
```
             confidence{A->B} = support{A,B} / support{A}
```
Confidence values range from 0 to 1, where 0 indicates that B is never purchased when A is purchased, and 1 indicates that B is always purchased whenever A is purchased. Note that the confidence measure is directional. This means that we can also compute the percentage of times that item A is purchased, given that item B was purchased:
```
             confidence{B->A} = support{A,B} / support{B}
```
In our example, the percentage of times that egg is purchased, given that apple was purchased is:
```
             confidence{apple->egg} = support{apple,egg} / support{apple}
                                    = (3/5) / (4/5)
                                    = 0.75 or 75%
```
A confidence value of 0.75 implies that out of all orders that contain apple, 75% of them also contain egg. Now, we look at the confidence measure in the opposite direction (ie: egg->apple):
```
             confidence{egg->apple} = support{apple,egg} / support{egg}
                                    = (3/5) / (3/5)
                                    = 1 or 100%  
```
Here we see that all of the orders that contain egg also contain apple. But, does this mean that there is a relationship between these two items, or are they occurring together in the same orders simply by chance? To answer this question, we look at another measure which takes into account the popularity of both items.
lift
Given two items, A and B, lift indicates whether there is a relationship between A and B, or whether the two items are occuring together in the same orders simply by chance (ie: at random). Unlike the confidence metric whose value may vary depending on direction (eg: confidence{A->B} may be different from confidence{B->A}), lift has no direction. This means that the lift{A,B} is always equal to the lift{B,A}:
```
             lift{A,B} = lift{B,A} = support{A,B} / (support{A} * support{B})
```
In our example, we compute lift as follows:
```
             lift{apple,egg} = lift{egg,apple} = support{apple,egg} / (support{apple} * support{egg})
                                               = (3/5) / (4/5 * 3/5) 
                                               = 1.25
```
One way to understand lift is to think of the denominator as the likelihood that A and B will appear in the same order if there was no relationship between them. In the example above, if apple occurred in 80% of the orders and egg occurred in 60% of the orders, then if there was no relationship between them, we would expect both of them to show up together in the same order 48% of the time (ie: 80% * 60%). The numerator, on the other hand, represents how often apple and egg actually appear together in the same order. In this example, that is 60% of the time. Taking the numerator and dividing it by the denominator, we get to how many more times apple and egg actually appear in the same order, compared to if there was no relationship between them (ie: that they are occurring together simply at random).

In summary, lift can take on the following values:
```
 * lift = 1 implies no relationship between A and B. 
   (ie: A and B occur together only by chance)

 * lift > 1 implies that there is a positive relationship between A and B.
   (ie:  A and B occur together more often than random)

 * lift < 1 implies that there is a negative relationship between A and B.
   (ie:  A and B occur together less often than random)
```
In our example, apple and egg occur together 1.25 times more than random, so we conclude that there exists a positive relationship between them.

Armed with knowledge of apriori and association rules mining, let's dive into the data and code to see what relationships we unravel!

Input Dataset¶

Instacart, an online grocer, has graciously made some of their datasets accessible to the public. The order and product datasets that we will be using can be downloaded from the link below, along with the data dictionary:

“The Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on September 1, 2017.

In [2]:

import pandas as pd
import numpy as np
import sys
from itertools import combinations, groupby
from collections import Counter

In [3]:

# Function that returns the size of an object in MB
def size(obj):
    return "{0:.2f} MB".format(sys.getsizeof(obj) / (1000 * 1000))

Part 1: Data Preparation¶

A. Load order data¶

In [4]:

orders = pd.read_csv('order_products__prior.csv')
print('orders -- dimensions: {0};   size: {1}'.format(orders.shape, size(orders)))
display(orders.head())

orders -- dimensions: (32434489, 4);   size: 1037.90 MB

	order_id	product_id	add_to_cart_order	reordered
0	2	33120	1	1
1	2	28985	2	1
2	2	9327	3	0
3	2	45918	4	1
4	2	30035	5	0

B. Convert order data into format expected by the association rules function¶

In [5]:

# Convert from DataFrame to a Series, with order_id as index and item_id as value
orders = orders.set_index('order_id')['product_id'].rename('item_id')
display(orders.head(10))
type(orders)

order_id
2    33120
2    28985
2     9327
2    45918
2    30035
2    17794
2    40141
2     1819
2    43668
3    33754
Name: item_id, dtype: int64

pandas.core.series.Series

C. Display summary statistics for order data¶

In [6]:

print('orders -- dimensions: {0};   size: {1};   unique_orders: {2};   unique_items: {3}'
      .format(orders.shape, size(orders), len(orders.index.unique()), len(orders.value_counts())))

orders -- dimensions: (32434489,);   size: 518.95 MB;   unique_orders: 3214874;   unique_items: 49677

Part 2: Association Rules Function¶

A. Helper functions to the main association rules function¶

In [7]:

# Returns frequency counts for items and item pairs
def freq(iterable):
    if type(iterable) == pd.core.series.Series:
        return iterable.value_counts().rename("freq")
    else: 
        return pd.Series(Counter(iterable)).rename("freq")

    
# Returns number of unique orders
def order_count(order_item):
    return len(set(order_item.index))


# Returns generator that yields item pairs, one at a time
def get_item_pairs(order_item):
    order_item = order_item.reset_index().as_matrix()
    for order_id, order_object in groupby(order_item, lambda x: x[0]):
        item_list = [item[1] for item in order_object]
              
        for item_pair in combinations(item_list, 2):
            yield item_pair
            

# Returns frequency and support associated with item
def merge_item_stats(item_pairs, item_stats):
    return (item_pairs
                .merge(item_stats.rename(columns={'freq': 'freqA', 'support': 'supportA'}), left_on='item_A', right_index=True)
                .merge(item_stats.rename(columns={'freq': 'freqB', 'support': 'supportB'}), left_on='item_B', right_index=True))


# Returns name associated with item
def merge_item_name(rules, item_name):
    columns = ['itemA','itemB','freqAB','supportAB','freqA','supportA','freqB','supportB', 
               'confidenceAtoB','confidenceBtoA','lift']
    rules = (rules
                .merge(item_name.rename(columns={'item_name': 'itemA'}), left_on='item_A', right_on='item_id')
                .merge(item_name.rename(columns={'item_name': 'itemB'}), left_on='item_B', right_on='item_id'))
    return rules[columns]

B. Association rules function¶

In [8]:

def association_rules(order_item, min_support):

    print("Starting order_item: {:22d}".format(len(order_item)))


    # Calculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100


    # Filter from order_item items below min support 
    qualifying_items       = item_stats[item_stats['support'] >= min_support].index
    order_item             = order_item[order_item.isin(qualifying_items)]

    print("Items with support >= {}: {:15d}".format(min_support, len(qualifying_items)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    # Filter from order_item orders with less than 2 items
    order_size             = freq(order_item.index)
    qualifying_orders      = order_size[order_size >= 2].index
    order_item             = order_item[order_item.index.isin(qualifying_orders)]

    print("Remaining orders with 2+ items: {:11d}".format(len(qualifying_orders)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    # Recalculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100


    # Get item pairs generator
    item_pair_gen          = get_item_pairs(order_item)


    # Calculate item pair frequency and support
    item_pairs              = freq(item_pair_gen).to_frame("freqAB")
    item_pairs['supportAB'] = item_pairs['freqAB'] / len(qualifying_orders) * 100

    print("Item pairs: {:31d}".format(len(item_pairs)))


    # Filter from item_pairs those below min support
    item_pairs              = item_pairs[item_pairs['supportAB'] >= min_support]

    print("Item pairs with support >= {}: {:10d}\n".format(min_support, len(item_pairs)))


    # Create table of association rules and compute relevant metrics
    item_pairs = item_pairs.reset_index().rename(columns={'level_0': 'item_A', 'level_1': 'item_B'})
    item_pairs = merge_item_stats(item_pairs, item_stats)
    
    item_pairs['confidenceAtoB'] = item_pairs['supportAB'] / item_pairs['supportA']
    item_pairs['confidenceBtoA'] = item_pairs['supportAB'] / item_pairs['supportB']
    item_pairs['lift']           = item_pairs['supportAB'] / (item_pairs['supportA'] * item_pairs['supportB'])
   

    # Return association rules sorted by lift in descending order
    return item_pairs.sort_values('lift', ascending=False)

Part 3: Association Rules Mining¶

In [9]:

%%time
rules = association_rules(orders, 0.01)

Starting order_item:               32434489
Items with support >= 0.01:           10906
Remaining order_item:              29843570
Remaining orders with 2+ items:     3013325
Remaining order_item:              29662716
Item pairs:                        30622410
Item pairs with support >= 0.01:      48751

CPU times: user 9min 26s, sys: 34.5 s, total: 10min 1s
Wall time: 10min 13s

In [10]:

# Replace item ID with item name and display association rules
item_name   = pd.read_csv('products.csv')
item_name   = item_name.rename(columns={'product_id':'item_id', 'product_name':'item_name'})
rules_final = merge_item_name(rules, item_name).sort_values('lift', ascending=False)
display(rules_final)

	itemA	itemB	freqAB	supportAB	freqA	supportA	freqB	supportB	confidenceAtoB	confidenceBtoA	lift
0	Organic Strawberry Chia Lowfat 2% Cottage Cheese	Organic Cottage Cheese Blueberry Acai Chia	306	0.010155	1163	0.038595	839	0.027843	0.263113	0.364720	9.449868
1	Grain Free Chicken Formula Cat Food	Grain Free Turkey Formula Cat Food	318	0.010553	1809	0.060033	879	0.029170	0.175788	0.361775	6.026229
3	Organic Fruit Yogurt Smoothie Mixed Berry	Apple Blueberry Fruit Yogurt Smoothie	349	0.011582	1518	0.050376	1249	0.041449	0.229908	0.279424	5.546732
9	Nonfat Strawberry With Fruit On The Bottom Gre...	0% Greek, Blueberry on the Bottom Yogurt	409	0.013573	1666	0.055288	1391	0.046162	0.245498	0.294033	5.318230
10	Organic Grapefruit Ginger Sparkling Yerba Mate	Cranberry Pomegranate Sparkling Yerba Mate	351	0.011648	1731	0.057445	1149	0.038131	0.202773	0.305483	5.317849
11	Baby Food Pouch - Roasted Carrot Spinach & Beans	Baby Food Pouch - Butternut Squash, Carrot & C...	332	0.011018	1503	0.049878	1290	0.042810	0.220892	0.257364	5.159830
12	Unsweetened Whole Milk Mixed Berry Greek Yogurt	Unsweetened Whole Milk Blueberry Greek Yogurt	438	0.014535	1622	0.053828	1621	0.053794	0.270037	0.270204	5.019798
23	Uncured Cracked Pepper Beef	Chipotle Beef & Pork Realstick	410	0.013606	1839	0.061029	1370	0.045465	0.222947	0.299270	4.903741
24	Organic Mango Yogurt	Organic Whole Milk Washington Black Cherry Yogurt	334	0.011084	1675	0.055586	1390	0.046128	0.199403	0.240288	4.322777
2	Grain Free Chicken Formula Cat Food	Grain Free Turkey & Salmon Formula Cat Food	391	0.012976	1809	0.060033	1553	0.051538	0.216142	0.251771	4.193848
25	Raspberry Essence Water	Unsweetened Pomegranate Essence Water	366	0.012146	2025	0.067202	1304	0.043274	0.180741	0.280675	4.176615
13	Unsweetened Whole Milk Strawberry Yogurt	Unsweetened Whole Milk Blueberry Greek Yogurt	440	0.014602	1965	0.065210	1621	0.053794	0.223919	0.271437	4.162489
14	Unsweetened Whole Milk Peach Greek Yogurt	Unsweetened Whole Milk Blueberry Greek Yogurt	421	0.013971	1922	0.063783	1621	0.053794	0.219043	0.259716	4.071849
44	Oh My Yog! Pacific Coast Strawberry Trilayer Y...	Oh My Yog! Organic Wild Quebec Blueberry Cream...	860	0.028540	2857	0.094812	2271	0.075365	0.301015	0.378688	3.994083
55	Mighty 4 Kale, Strawberry, Amaranth & Greek Yo...	Mighty 4 Essential Tots Spinach, Kiwi, Barley ...	390	0.012943	2206	0.073208	1337	0.044370	0.176791	0.291698	3.984498
20	Unsweetened Whole Milk Peach Greek Yogurt	Unsweetened Whole Milk Strawberry Yogurt	499	0.016560	1922	0.063783	1965	0.065210	0.259625	0.253944	3.981352
65	0% Greek, Blueberry on the Bottom Yogurt	Nonfat Strawberry With Fruit On The Bottom Gre...	305	0.010122	1391	0.046162	1666	0.055288	0.219267	0.183073	3.965918
15	Unsweetened Whole Milk Mixed Berry Greek Yogurt	Unsweetened Whole Milk Peach Greek Yogurt	410	0.013606	1622	0.053828	1922	0.063783	0.252774	0.213319	3.963014
43	Unsweetened Whole Milk Peach Greek Yogurt	Unsweetened Whole Milk Mixed Berry Greek Yogurt	407	0.013507	1922	0.063783	1622	0.053828	0.211759	0.250925	3.934016
26	Unsweetened Blackberry Water	Unsweetened Pomegranate Essence Water	494	0.016394	3114	0.103341	1304	0.043274	0.158638	0.378834	3.665867
19	Unsweetened Whole Milk Mixed Berry Greek Yogurt	Unsweetened Whole Milk Strawberry Yogurt	383	0.012710	1622	0.053828	1965	0.065210	0.236128	0.194911	3.621024
16	Unsweetened Whole Milk Strawberry Yogurt	Unsweetened Whole Milk Peach Greek Yogurt	444	0.014735	1965	0.065210	1922	0.063783	0.225954	0.231009	3.542526
56	Mighty 4 Sweet Potato, Blueberry, Millet & Gre...	Mighty 4 Essential Tots Spinach, Kiwi, Barley ...	398	0.013208	2534	0.084093	1337	0.044370	0.157064	0.297681	3.539900
74	Sweet Potatoes Stage 2	Organic Stage 2 Winter Squash Baby Food Puree	322	0.010686	2077	0.068927	1322	0.043872	0.155031	0.243570	3.533734
79	Compostable Forks	Plastic Spoons	321	0.010653	1528	0.050708	1838	0.060996	0.210079	0.174646	3.444151
75	Organic Stage 2 Carrots Baby Food	Organic Stage 2 Winter Squash Baby Food Puree	337	0.011184	2306	0.076527	1322	0.043872	0.146141	0.254917	3.331080
42	Unsweetened Whole Milk Strawberry Yogurt	Unsweetened Whole Milk Mixed Berry Greek Yogurt	352	0.011681	1965	0.065210	1622	0.053828	0.179135	0.217016	3.327938
21	Unsweetened Whole Milk Blueberry Greek Yogurt	Unsweetened Whole Milk Strawberry Yogurt	350	0.011615	1621	0.053794	1965	0.065210	0.215916	0.178117	3.311071
17	Unsweetened Whole Milk Blueberry Greek Yogurt	Unsweetened Whole Milk Peach Greek Yogurt	341	0.011316	1621	0.053794	1922	0.063783	0.210364	0.177419	3.298101
83	Cream Top Blueberry Yogurt	Cream Top Peach on the Bottom Yogurt	313	0.010387	1676	0.055620	1748	0.058009	0.186754	0.179062	3.219399
...	...	...	...	...	...	...	...	...	...	...	...
22444	Large Lemon	Hass Avocados	468	0.015531	152177	5.050136	49246	1.634274	0.003075	0.009503	0.001882
2577	Red Onion	Bag of Organic Bananas	1008	0.033451	42906	1.423876	376367	12.490090	0.023493	0.002678	0.001881
250	Roasted Pine Nut Hummus	Banana	327	0.010852	11176	0.370886	470096	15.600574	0.029259	0.000696	0.001876
655	Organic Large Green Asparagus	Banana	556	0.018451	19228	0.638099	470096	15.600574	0.028916	0.001183	0.001854
40897	Banana	Organic Extra Virgin Olive Oil	369	0.012246	470096	15.600574	12788	0.424382	0.000785	0.028855	0.001850
2652	Spinach	Bag of Organic Bananas	383	0.012710	16766	0.556395	376367	12.490090	0.022844	0.001018	0.001829
2722	Sour Cream	Bag of Organic Bananas	486	0.016128	21481	0.712867	376367	12.490090	0.022625	0.001291	0.001811
11143	Organic Blueberries	Blueberries	329	0.010918	99359	3.297321	55703	1.848556	0.003311	0.005906	0.001791
2537	Green Onions	Bag of Organic Bananas	592	0.019646	26467	0.878332	376367	12.490090	0.022367	0.001573	0.001791
1386	2% Reduced Fat Milk	Organic Strawberries	574	0.019049	36768	1.220180	263416	8.741706	0.015611	0.002179	0.001786
3291	2% Reduced Fat Milk	Organic Baby Spinach	523	0.017356	36768	1.220180	240637	7.985763	0.014224	0.002173	0.001781
530	Chocolate Chip Cookies	Banana	377	0.012511	13688	0.454249	470096	15.600574	0.027542	0.000802	0.001765
10681	Half & Half	Organic Half & Half	302	0.010022	68842	2.284586	75334	2.500029	0.004387	0.004009	0.001755
5446	Organic Reduced Fat 2% Milk	Organic Whole Milk	379	0.012577	47593	1.579418	136832	4.540898	0.007963	0.002770	0.001754
11455	Banana	Soda	864	0.028673	470096	15.600574	33008	1.095401	0.001838	0.026175	0.001678
11421	Bag of Organic Bananas	Fridge Pack Cola	366	0.012146	376367	12.490090	18005	0.597513	0.000972	0.020328	0.001628
2568	Asparation/Broccolini/Baby Broccoli	Bag of Organic Bananas	317	0.010520	16480	0.546904	376367	12.490090	0.019235	0.000842	0.001540
19596	Banana	Organic Tortilla Chips	320	0.010619	470096	15.600574	13458	0.446616	0.000681	0.023778	0.001524
2319	Fridge Pack Cola	Bag of Organic Bananas	341	0.011316	18005	0.597513	376367	12.490090	0.018939	0.000906	0.001516
11017	Organic Baby Spinach	2% Reduced Fat Milk	403	0.013374	240637	7.985763	36768	1.220180	0.001675	0.010961	0.001373
22572	Organic Raspberries	Raspberries	322	0.010686	136621	4.533895	56858	1.886886	0.002357	0.005663	0.001249
11012	Organic Strawberries	2% Reduced Fat Milk	371	0.012312	263416	8.741706	36768	1.220180	0.001408	0.010090	0.001154
246	Soda	Banana	531	0.017622	33008	1.095401	470096	15.600574	0.016087	0.001130	0.001031
11555	Banana	Clementines	397	0.013175	470096	15.600574	29798	0.988874	0.000845	0.013323	0.000854
1474	Strawberries	Organic Strawberries	706	0.023429	141805	4.705931	263416	8.741706	0.004979	0.002680	0.000570
7271	Organic Strawberries	Strawberries	640	0.021239	263416	8.741706	141805	4.705931	0.002430	0.004513	0.000516
6763	Organic Hass Avocado	Organic Avocado	464	0.015398	212785	7.061469	176241	5.848722	0.002181	0.002633	0.000373
4387	Organic Avocado	Organic Hass Avocado	443	0.014701	176241	5.848722	212785	7.061469	0.002514	0.002082	0.000356
2596	Banana	Bag of Organic Bananas	654	0.021704	470096	15.600574	376367	12.490090	0.001391	0.001738	0.000111
670	Bag of Organic Bananas	Banana	522	0.017323	376367	12.490090	470096	15.600574	0.001387	0.001110	0.000089

48751 rows × 11 columns

Part 4: Conclusion¶

From the output above, we see that the top associations are not surprising, with one flavor of an item being purchased with another flavor from the same item family (eg: Strawberry Chia Cottage Cheese with Blueberry Acai Cottage Cheese, Chicken Cat Food with Turkey Cat Food, etc). As mentioned, one common application of association rules mining is in the domain of recommender systems. Once item pairs have been identified as having positive relationship, recommendations can be made to customers in order to increase sales. And hopefully, along the way, also introduce customers to items they never would have tried before or even imagined existed! If you wish to see the Python notebook corresponding to the code above, please click here.

datathèque

Association Rules Mining Using Python Generators to Handle Large Datasets