In the depths of the last cold, wet British winter, the Advanced Data Analytics team from Barclays escaped to a villa on Lanzarote, Canary Islands, for a one week hackathon where they collaboratively developed a recommendation system on top of Apache Spark. The contest consisted on using Bristol customer shopping behaviour data to make personalised recommendations in a sort of Kaggle-like competition where each team's goal was to build an MVP and then repeatedly iterate on it using common interfaces defined by a specifically built framework.
The talk will cover:
• How to rapidly prototype in Spark (via the native Scala API) on your laptop and magically scale to a production cluster without huge re-engineering effort.
• The benefits of doing type-safe ETLs representing data in hybrid, and possibly nested, structures like case classes.
• Enhanced collaboration and fair performance comparison by sharing ad-hoc APIs plugged into a common evaluation framework.
• The co-existence of machine learning models available in MLlib and domain-specific bespoke algorithms implemented from scratch.
• A showcase of different families of recommender models (business-to-business similarity, customer-to-customer similarity, matrix factorisation, random forest and ensembling techniques).
• How Scala (and functional programming) helped our cause.
Gianmario is a Senior Data Scientist at Pirelli Tyre, processing telemetry data for smart manufacturing and connected vehicles applications. His main expertise is on building production-oriented machine learning systems. Co-author of the Professional Manifesto for Data Science, he loves evangelising his passion for best practices and effective methodologies amongst the community. Prior to Pirelli, he worked in Financial Services (Barclays), Cyber Security (Cisco) and Predictive Marketing (AgilOne).
Similar to The Barclays Data Science Hackathon: Building Retail Recommender Systems based on Customer Shopping Behaviour - Gianmario Spacagna, Pirelli (20)
Decoding Patterns: Customer Churn Prediction Data Analysis Project
The Barclays Data Science Hackathon: Building Retail Recommender Systems based on Customer Shopping Behaviour - Gianmario Spacagna, Pirelli
1. The Barclays Data
Science Hackathon:
Building Retail Recommender
Systems based on Customer
Shopping Behavior
Gianmario Spacagna
@gm_spacagna
Data Science Milan meetup, 13 July 2016
2. The Barclays Data Science Team
• Retail Business Banking division based in the HQ
(Canary Wharf, London)
• Back in time (Dec 2015) was 6 members:
Head + mix of (engineering and machine learning) specialists
• Goal: building data-driven applications such as:
– Insights Engine for small businesses
– Complaints NLP analytics
– Mortgage predictive models
– Pricing optimisation
– Graph fraud detection
– and so on...
3. Lanzarote off-site
• 1 week (5 days contest
Monday - Friday)
• Building a recommender
system of retail merchants for
people living in Bristol, UK
• Forget about 9-5 working
hours
• Stimulate creativity and team-
working
• Brainstorm new ideas and
make them happen
• Have fun!
4. The technical challenges
• No infrastructure available, only laptops and a
1G WiFi shared Internet connection.
• Build, test, and refactor quickly,
no time for long end-to-end evaluations.
• Work with common structures without
constraining individual initiative and innovation.
• Design for deployment to production on a multi-
tenant cluster.
8. Why Spark? (just to name a few…)
• Speed / performance, in-memory solution
• Elastic jobs, you can start small and scale up
• What works locally works distributed, almost!
• Single place for doing everything from source to the
endpoint
• It cuts development time being designed according to
functional programming principles
• Reproducibility via a DAG of declarative transformations
rather than procedural side-effect actions
9. Preparation work (ETL)
• Extract, transform and load data into representations
matching the business domain rather than the raw
database representation
• Aggregate in order to increase generality but
preserving anonymised information for training the
models
• Every business is uniquely represented by the
combo (MerchantName, MerchantTown) + optionally
a postcode when available
• Join each transaction happened in Bristol with the
business and customer details
10. Anonymised Generalised Data
• Bottom-up k-anonymity:
– Map all of the categorical attributes of each customer
(online active flag, residential area type, gender,
marital status, occupation) into a bucket
– Group similar customers and replace the single
bucket with a group of buckets and count the number
of group members
– Recursively continue until each user is mapped into a
bucket group with at least k members
• Masking:
– Replace user identifiers with uniquely generated IDs
11. K-anonymity example
!mestamp customerId occupa!
on
gender amount business
2015-03-05 9218324 Engineer male 58.42 Waitrose
2015-03-06 324624 Cook female 118.90 Waitrose
2015-03-06
324624 Cook female 5.99 Abokado
Categorical bucket Day of
week
custome
rId
amount business
engineer-male,
student-male,
cook-female
Thursday 00003 [50-60] Waitrose
Friday 00012 [100--1
20]
Waitrose
Friday 00012 [0-10] Abokado
13. Some numbers (Bristol only)
• ~ 70 GB of data
(Kryo serialized format)
• A few millions
transactions from 2015
(1 year worth of data)
• ~ 100k Barclays retail
customers
• ~ 50K Businesses
14. Recommender APIs
• RecommenderTrainer receives the raw data and has to
perform the feature engineering tailored for the specific
implementation and return a Recommender model instance.
• The Recommender instance takes an RDD of customer ids
and a positive number N and returns at top N
recommendations for each customer.
• We used the pair (MerchantName, MerchantTown) to
represent the unique business we want to recommend.
17. Mean Average Precision (MAP)
• Each customer has visited m relevant businesses
• Recommendations predict n ranked businesses
• For a given customer we compute the average precision as:
• P(k) = precision at cut-off k in the recommendation list, i.e.
the ratio of number of relevant businesses, up to the
position k.
P(k) = 0 when the k-th business is not relevant.
• MAP for N customers at n is the average of the average
precision of each customer:
ap@n = P(k) / min(m,n)
k=1
n
∑
MAP@n = ap @ ni
/ N
i=1
N
∑
18. MAP example
= Businesses visited by test user Bob
? ? ?
Recommenda@ons
#Bob, N = 6
Precision(k): 1/1 0 2/3 0 0 3/6
Average Precision #Bob = (1 + 2/3 + 3/6) / 3 = 0.722
Average Precision #Alice = (1/2 + 2/5) / 2 = 0.45
MAP@6 = (0.722 + 0.45) / 2 = 0.586
= Businesses visited by test user Alice
? ?
Recommenda@ons
#Alice, N = 6
Precision(k): 0 1/2 0 0 2/5 0
? ?
20. CUSTOMER-TO-CUSTOMER
SIMILARITY MODELS
Each customer is represented in a sparse feature space
Must define a metric space that satisfies the triangle inequality
Similarity (or distance) based on:
Common behaviour (geographical and temporal shopping journeys)
Common demographic attributes (age, residential area, gender, job
position…)
21. Customer Features
• Represent each customer in terms of histograms:
– Distribution of spending across different dimensions:
• week days, postcode sectors, merchant categories, businesses
– Probability distributions of its generalised attributes:
• Online activity, gender, marital status, occupation
• If we flatten each map and fill with 0s all of the
missing keys, we can then compute the cosine
distance between two customers
22. Extracting Customer Features 1/2
Businesses are
too many to fit
into a Map, we
only take the
top ones and
assume the tail
to be negligible
Wallet histogram:
Count of each (customer, bin)
using reduceByKey followed
by groupBy on customer to
merge all of the bins count
into a map
23. Extracting Customer Features 2/2
Broadcast
variables
should be
destroyed at
the end of
their scope
1. select the
dis@nct
customer Id
with the
associated
categorical
group
2. perform a
map-side mul@-
join:
One map over
the whole RDD
with mul@ple
look-ups into
broadcast maps
25. Vantage-point (VP) Tree
• It’s an heuristic data structure
for fast spatial search
• Each node of the tree contains
one data point + a radius
– Left child branch contains points
that are closer than the radius,
right the farther away
• Construction time: O(n log(n))
• Search time*: O(log(n))
*Under certain circumstances
27. Common customers matrix
Sum
- 3 10 12 25
3 - 8 0 11
10 8 - 1 19
12 0 1 - 13
Sum
25 11 19 13 -
Each cell
represent the
dis@nct number
of common
customers
Business
similari@es:
• Condi@onal
probability
• Tanimoto
coefficient
29. NEIGHBOUR-TO-BUSINESS
Hybrid approach of K-Neighbours combined with
Business-to-Business
3 levels: customer neighbours -> neighbour’s
businesses -> businesses’ neighbours
We named this model: Botticelli model
32. MATRIX FACTORIZATION
MODELS
Factorize the transaction matrix of Customer-to-
Business into 2 matrices of Customer-to-Topic
and Topic-to-Business (e.g. LSA, SVD…)
Recommendations are done by applying linear
algebra
34. ALS is available in Spark MLlib
Ra@ngs as
counts of
transac@ons
Model parameters are the
factorized matrices. We had to
re-implement the scoring
func@on due to scalability issues
36. Top N without sorting
Accumulator is at most N elements
37. OTHER APPROACHES
Covariance Matrix:
build a covariance matrix of each pair of users and then
multiply it with the user-to-business matrix
Random Forest:
one binary classifier for each business
Ensembling models:
aggregating recommendations from different models
40. Limitations
• ML and MLlib are not flexible enough and need
some extra development (bloody private fields)
• Linear algebra libraries in MLlib are limited, it
took as a while to learn how to optimize them
• Scala and Spark create confusion for some
method behaviour
(e.g. fold, collect, mapValues, groupBy)
• Many machine learning libraries are based on
vectors and don’t easily allow ad-hoc definition
of data types based on the business context
41. Conclusions
• Spark and Scala were excellent tools for rapid
prototyping during the week, especially for
bespoke algorithms.
• We used the same production stack together
with notebooks for ad-hoc explorations or quick
and dirty tests.
• At the end of the hackathon the best model is
almost a production-ready MVP
43. Off-site
• Success of the hackathon was not solely down
to technology.
• Innovation requires an environment where:
– great people can connect
– set clear ambitious goals
– work together free of distractions
– pressure of delivering comes from the group
– Fail safely, go to sleep, wake up next day (go surfing)
and try again!
44. https://blog.cloudera.com/blog/2016/05/the-barclays-data-science-hackathon-using-apache-spark-and-scala-for-rapid-prototyping/
Original article on Cloudera Engineering Blog
https://github.com/gm-spacagna/lanzarote-awesomeness
GitHub code
Further Reading
A lot of references regarding Agile and Spark
http://datasciencevademecum.wordpress.com
Data Science Vademecum
The Barclays Data Science team at this hackathon was:
Panos Malliakas, Victor Paraschiv, Harry Powell, Charis
Sfyrakis, Gianmario Spacagna and Raffael Strassnig
http://www.datasciencemanifesto.org/
The Professional Data Science Manifesto