Machine Learning and AI applied to Social Data

eCairn has been using Machine Learning/AI since 2017, leveraging years of analyzing, categorizing, and manipulating social data.

We apply AI/ML in two of our core processes: Content curation and Affluent Prediction.

1- Content curation.

The objective of content curation is to read a large number of posts, tweets, and identify and tag the opportunities/insights.

We used to do this with a combination of topic engineering and trained human curators (which we still do for white glove services). The curator would look at a pre-filtered river of news/tweets and tag the opportunities and insights, according to what our client is interested in.

It works great, but comes with some limitations:

  • It takes time and the time increases exponentially with the number of people listened to.
  • Sometimes, the interesting part of a tweet is the “linked” article or picture, and the topic filtering cannot catch these opportunities
  • Curators, even the best ones, have bad days ūüėČ

As the technology around AI/ML became more affordable, we have automated the curation process. We are training our models with the tens of thousands of hand-crafted examples from our years of listening manually to affluents.

We are primarily using: Amazon ML, Sagemaker, Spacy, and Amazon recognition.

The algorithm (model) “reads” the tweet and assesses whether the tweet belongs to one of our ten predefined categories:¬†¬†Opinion, Question and Request, Event, …. .

Here are a few tweets which were captured by the ML algorithm and categorized:

 

Yes, our definition of “life events” includes pets ;-).

 

From our measures, the¬†machine is clearly superior to a human in executing this process, even if from time to time, a “false positive” slips into the list of opportunities and insights.

1). The volume of opportunities detected by the machine is 4 times what human curators used to identify.

2). Some opportunities are easier to spot by a machine (sometimes with pre or post-processing). Here are a few examples:

  • Life Events: We observed that many people share family pictures, and that picture analysis is critical to detect life events.
  • Local Tweets: there are only a few tweets that are geolocalized. Yet, many tweets/posts/stories carry location information. If, for example, I tweet “I am going to CES” or “I’m attending a Warrior’s game tomorrow” or “Great view from the Sisters¬†” it’s clear that I will be respectively in Vegas, San Francisco or Bend.¬† You can’t really find people that do this for any location and any context. Using named entity extraction along with the Google Map API addresses this issue.
  • Authored content i.e people publicizing on Twitter articles they have authored in their blog, newspaper. This often requires analyzing URLs and cross-checking usernames. To check “me”, you would need to connect my Twitter id @dominiq with this blog ecairn.com/blog.
  • Events: Mentions or links to event platforms (Eventbrite, Meetup…) are important signals to identify an event.

 

2- Affluence prediction

Many of our clients are targeting Affluents (Wealth Management, Luxury …) and one of the core capabilities we provide is to identify lists of digital affluents in a specific geography, niche, market segment.

Unlike specialized companies like Wealthengine, Wealth-X, Donorsearch, who collect brick and mortar data about salaries, property value,¬† stock purchases, donation… we primarily use social data to predict if someone is/is not affluent.

Keep in mind our goal is, starting with a huge list of say 20K followers of a brand on Twitter, spot the few hundreds who probably have a lot of money (~$1M investable). Collecting the real-life data for the 20K followers each time is very expensive and does not provide good coverage for “new/ digital affluents”

To solve this problem with AI, we are using large training sets (~10K affluents and non-affluents people) that we manually tagged and partially validated using “brick and mortar data” and we have used this training sets to make affluence prediction.

The challenges we faced building these algorithms were quite typical:

1-Define the problem.

After many tests/iteration, we realized that we were not able to build a generic algorithm and that we needed to build one algorithm per metropolitan area: San Francisco, New York, Los Angeles.

The reason for this is that affluents in Los Angeles and in San Francisco as an example behave very differently in social media. So far we have not been able to “generalize” our models and we operate and maintain models per regional market.

Another challenge is that social data is really messy and before predicting that someone is affluent, we need to do a lot of clean up. We need to filter out accounts such as bots,  brands,  avatars, or even dogs!?

2- Selecting the feature set.

The challenge is to select features that are informative and are readily available for a large chunk of the population that we profile.

Here is a partial list of features that we use:

  • bio
  • within bio, we use both content and style i.e the quality of the wording, the use of emoticons, excessive use of punctuation…
  • job title, company, location
  • company bio and attributes
  • location bio and attributes (is the Zipcode an affluent zip code?)
  • social graph (who these people follow & who follow them back)
  • and some other features that belong to our secret sauce.

3- Building a solid training set.

In our experience, we need more than 5000 positive and 5000 negative examples to build a good model. That’s a lot of people to tag manually! We improve/validate our training sets using (expensive) brick and mortar data points.

Our goal is to be “as good at guessing that someone is affluent” as a banker researching someone’s profile. But doing much faster like thousands per minute. Also, we are providing a guess nor truth:¬†¬†We don’t know if “a partner in a law firm” is spending all his/her money in Vegas¬† …. unless they follow many casinos ;-). Still, we don’t know if they WIN.

Here is the performance that we get. The graph is for the San Francisco Bay Area market:

 

As our objective is “filtering”, our main concern is the false positive and we get to the 90% augmenting the threshold.

 

Want to learn more about our technology/ AI practice, just go ahead and book a 30mn call:  https://meetme.so/DominiqueLahaix

 

@dominiq

 

 

 

 

 

 

 

 

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *