數據挖掘演算法：啤酒與尿布的SQL簡單實現

2018-02-13
加入好友隨時分享有用經驗！

啤酒與尿布的演算法屬於電商領域中的購物籃分析，是用於挖掘不同商品之間的關聯規則，進而進行商品推薦的常用演算法。

以下為該演算法的SQL簡單實現。

1、源數據

select * from wx_tmp1;

數據格式如下

第一列為訂單id，第二列為商品id，第三列為商品名稱（真實數據，故擦掉），每個商品一條記錄。

2、統計每兩種商品同時被購買的次數

create table wx_tmp2 as
select a.skuid skuid_a,b.skuid skuid_b,count(distinct a.orderid) buy_cnt_together
from wx_tmp1 a join wx_tmp1 b on a.orderid=b.orderid

Advertisements

where a.skuid<>b.skuid
group by 1,2;

buy_cnt_together為兩種商品同時購買的訂單量

3、計算置信度、支持度及提升度

create table wx_tmp3 as
select a.product_a,a.product_b,a.buy_cnt_together,b.buy_cnt buy_cnt_a,c.buy_cnt buy_cnt_b,d.order_cnt,
a.buy_cnt_together/d.order_cnt support, --支持度
a.buy_cnt_together/b.buy_cnt confidence, --置信度

Advertisements

a.buy_cnt_together/(b.buy_cnt*c.buy_cnt/d.order_cnt) lift, --提升度
(a.buy_cnt_together/b.buy_cnt+a.buy_cnt_together/c.buy_cnt)/2 kulc, --KULC 度量
a.buy_cnt_together/b.buy_cnt/(a.buy_cnt_together/c.buy_cnt) ir --不平衡比
from wx_tmp2 a join
(
select skuid,count(distinct orderid) buy_cnt
from wx_tmp1
group by 1
) b on a.product_a=b.skuid join
(
select skuid,count(distinct orderid) buy_cnt
from wx_tmp1
group by 1
) c on a.product_b=c.skuid join
(
select count(distinct orderid) order_cnt
from wx_tmp1
) d on 1=1;

結果如下

每一行即商品A和商品B的關聯規則。

第一列為商品A，第二列為商品B，第三列為商品A與商品B同時購買的訂單量，第四列為購買商品A的訂單量，第五列為購買商品B的訂單量，第六列為總的訂單量，

第七列為支持度，即buy_cnt_together/order_cnt

第八列為置信度，即buy_cnt_together/buy_cnt_a

第九列為提升度，即buy_cnt_together/buy_cnt_a/(buy_cnt_b/order_cnt)，可以用來評估商品A對商品B的提升程度。

如果提升度等於 1，說明兩個商品沒有任何關聯。如果小於 1，說明兩個商品是負相關的關係，意味著商品A的銷量越高，可能商品B的銷量越低。大於 1 才表示具有正相關的關係。一般當提升度大於 3 時，我們才認為挖掘出的關聯規則是有價值的。

提升度是一種判斷關聯規則是否有效的簡單手段，實際中受零事務（也即不包含商品A也不包含商品B的訂單）的影響比較大。所以如果數據中含有的零事務數量較大，該度量則不合適使用。

第十列為KULC 度量，(buy_cnt_together/buy_cnt_a+buy_cnt_together/)/2，可以理解為兩個置信度的平均值，取值0 ~ 1，值越大，聯繫越大，關係越大，是常用的判斷關聯規則是否有效的度量。同時該度量不受零事務數量影響。

第十一列為IR（不平衡比），即buy_cnt_b/buy_cnt_a，如果IR值太大，說明兩個商品的關聯關係非常不平衡，購買商品A的顧客很可能同時會買商品B，而購買了商品B的用戶不太會再去買商品A。

Advertisements

Advertisements

Advertisements

你可能會喜歡

Advertisements