Data Scientists are compared to Unicorns, a rare and mythical creature with great powers. While one can find folks with machine learning skills, it’s extremely rare to find one who’s also a domain expert with a tremendous understanding of the business. The business landscape is changing fast today. One example is retail, an industry that is seeing tremendous disruption due to the shift from brick-and-mortar to online buying. Machine learning is heavily used in this industry to augment the customer buying experience with individualized just-in-time cross-sell and up-sell offers. Let’s walk through a retail use case. We’ll start by identifying the business problem and then we’ll use data science to find a solution.

We imagined a fictitious outdoor equipment company called Great Outdoors that sells camping, mountaineering, golf, personal accessories and outdoor protection product lines. There’s only one product line that’s seeing declining sales year-over-year for the past few years and that’s outdoor protection, which includes products such as insect repellents and sunscreens (Figure 1).

Figure 1: Steady decline in volume of outdoor protection sales

The company reaches out to the data scientist to investigate the issue. After a first look at the data, the first question the data scientist asks is whether this declining sales trend can be correlated to weather pattern, as in less warmer days in a year reducing outdoor activity. He builds a quick exploration notebook in Python and runs it in the Spark framework. He uses a single line of the open-source Brunel visualization language, which is interactive, powerful and easy-to-code:

%brunel data(‘pd_weather’) stack bar x(year) y(total_days) transpose color(category:[blue,teal,red,yellow]) percent(total_days) label(total_days) axes(x) tooltip(#all) :: height=100, width=600

From that, he finds that weather hasn’t changed enough in terms of the percentage of warm and hot days to explain the cause of declining sales (Figure 2).

Figure 2: Similar number of hot and warm days for two consecutive years

He next wonders whether reduced foot traffic for the zone in the retail store carrying the poor selling item could be a factor. He explores foot traffic that is tracked through Bluetooth beacon data captured from the smartphone of the customers during their journey through a retail store. By exploring the foot traffic through each zone and more importantly the time spent, one can gauge customer’s interest in browsing and eventually buying products from a given zone. He notices that the top selling product line (camping equipment) gets similar foot traffic as the poor selling category outdoor protection, but the customer spends much less time in the poor selling zone, as expected (Figure 3).

Figure 3: Similar visits to Camping Equipment and Outdoor protection but a lot less time spent

Finding that customers spend the least amount of time within the outdoor protection zone corroborates[1] the poor selling volume for that product line. However, what remains a mystery in the data scientist’s mind is what causes customers to avoid this product line. He then remembers that Great Outdoors had collected customer feedback on their product lines through online surveys. He explores those survey results and finds that across the key brand characteristics, the outdoor protection product line receives the lowest score, clearly highlighting customer dissatisfaction with the products that the retailer carries in that line (Figure 4).

Figure 4: Across the board outdoor protection receives the lowest survey score

Seeing the low survey scores for outdoor protection, the data scientist decides to pull some Twitter data from customers tweeting about outdoor protection use during camping. He builds a Scala notebook to apply the LDA (Latent Dirichlet Allocation) algorithm to create ten topic clusters from the tweets. Of course, he finds topics about the regular camping items such as tent shape (dome) and characteristics (water-proof), but a couple of topics pop up about allergies related to the chemical deet in insect repellents and brain/nerve damage caused by mosquito bites (Figure 5). The concern about insect repellents expressed on Twitter explains the low scores on the survey for the outdoor protection category.

Figure 5: Topic cluster discussing allergic reactions from deet and brain damage caused by insects

As is common with poor selling products, to deplete the inventory, the retailer considers putting the product on sale. The data scientist wearing the business analyst hat takes a quick look at the gross margin across the product lines. He notices that the poorest selling line, outdoor protection, has the highest gross margin at near 60%, where the rest of the lines are in the 40% range (Figure 6). This leaves the retailer headroom to provide 10–20% discounts on items in the outdoor protection category and still maintain gross margin for that product line above the rest.

Figure 6: Poor selling category, outdoor protection, has the highest gross margin

In preparation for the sales promotions, the data scientist takes a look at the customer buying preferences and notices that there’s a difference between what males and females buy. The top selling item for females is Eyewear followed by Sleeping Bags and Binoculars, whereas men buy Tents, Sleeping Bags and Back Packs. The top selling item for men (Tents) is the 6th selling item for women, whereas the top selling item for women (Eyewear) is the 11th selling item for men (Figure 7).

Figure 7: Top selling products across men and women

Seeing the impact of demographics on customer buying behavior, the data scientist builds a decision tree based multi-classifier for the top 4 product lines using C5.0 algorithm with a R notebook. The key insights from the decision tree as shown below from the Brunel chart (Figure 8) is that men older than 44 mostly buy golfing equipment (red category), single professional women younger than 44 years old buy personal accessories (green), single men with a Sales career and younger than 26 buy Mountaineering equipment (yellow) and married men between 29 and 44 years old buy camping equipment (blue).

Figure 8: C5.0 Decision tree based multi-class classifier for the product line

The data scientist then explores an opportunity to do a cross-sell campaign across product lines using market basket analysis. Using the Apriori algorithm in a R notebook, he finds rules based on product lines that occur together in a market basket. He draws a chord chart (Figure 9) using Brunel to visualize the association rules with single element (product line) on each side of the rule. The color of the chord (the arc between two product lines shown on the circumference of the circle) shows the confidence of the rule and the width shows the support. The rule that stands out is the one between outdoor protection and camping equipment in the bottom of the circle, with a fat chord and red color indicating relatively high support (0.2) and confidence (0.6). This rule implies that 20% of the transactions have outdoor protection items in the market basket and out of them 60% also have camping equipment. This suggests that these two product lines could be combined into a cross-sell campaign to achieve the best outcome. For example, the retailer could choose to target tent buyers with discount offers on insect repellents.

Figure 9: Visualizing market basket analysis across product lines

In addition to the cross-sell campaign, the retailer decides to work on providing an enriched experience to their customers through a chatbot interaction on the website. The goal is to address concerns that customers have about the outdoor protection category such as allergy and recommend nature friendly products that don’t involve the deet chemical (that customers have complained about on Twitter). The data scientist builds a classification logistic regression (LR) model using a Scala notebook to predict the likely buyers of insect repellents. He uses demographic attributes such as gender, age, marital status, profession and negative tweet count to predict the buying propensity.

He then plots the performance of the model using a receiver operating characteristics (ROC) curve showing the true positive versus false positive rate and sees a strong prediction with a steep curve on the holdout data (Figure 10). This represents a model that can predict the insect repellent buyers with strong recall (predicts most of them correctly) and keeps the incorrect prediction of buyers from those that don’t buy insect repellents as low as possible.

Figure 10: ROC Curve of LR Classification model for Insect Repellents

The data scientist deploys the Repellent model and provides a RESTful endpoint to the application developer to call from within a chatbot (Figure 11). When the customer signs on to the retailer’s web page, he sees a chatbot pop up. The chatbot makes a live call to the scoring service passing JSON data with the customer profile, tent ownership and negative tweet count to find that the customer is unlikely to buy a repellent (2% probability of buying) and has also tweeted negatively about the product line due to allergic reactions. As a result, the chatbot makes a recommendation for a deet-free and nature-friendly Eucalyptus extract based repellent with a 50% offer to make the customer a happy camper (pun intended).

Figure 11: Chatbot interaction with a customer addressing deet concerns and offering a discount

In summary, a data scientist can truly make a huge difference to a business by identifying the problem at hand through exploratory analysis and then building one or more machine learning models to help predict customer behavior — and then tie it all together by working with an application developer to integrate the model scoring into an operational business process.

The analyses mentioned in this blog were created using powerful visual exploration with self-service and guided discovery tool IBM Watson Analytics and data science notebooks through IBM Data Science Experience Desktop (currently in Beta). The chatbot was built using Cognitive services in IBM Watson Developer Cloud and the application was built with Ruby on Rails in IBM Bluemix.

Acknowledgements: Thanks to my colleagues Aleksandr Petrov (data scientist), David Thomason (application developer) and Sivakumar Anne (data engineer) for their key contributions on this use case.


By: Avijit Chatterjee

Read more: