I recently happened across this Tweet from Mike Kearney about his new R package called
botornot
. It’s core function is to classify Twitter profiles into two categories: “bot” or “not”.
Having seen the tweet, I couldn’t not take the package for a spin. In this post we’ll use try to determine which of the Buffer team’s Twitter accounts are most bot-like. We’ll also test the
botornot
model on accounts that we know to be bots.Data Collection
The
botornot
function requires a list of Twitter account handles. To gather the Buffer team’s accounts, we can collect recent tweets from the Buffer team Twitter list using the rtweet
package, and extract the screen_name
field from the collected tweets. First we need to load the libraries we’ll need for the analysis.
This query only returns tweet data from the past 6-9 days.
We can gather and list the account names from the
tweets
dataframe.
Great, most of the team is present in this list. Interestingly, accounts like
@bufferdevs
and @bufferlove
are also included. It will be interesting to see if they are assigned high probabilities of being bots!The Anti Turing Test
Let’s see if the humans can convince an algorithm that they are not bots. Before we begin, it may be useful to explain how the model actually works.
According to the package’s README, the default gradient boosted model uses both users-level (bio, location, number of followers and friends, etc.) and tweets-level (number of hashtags, mentions, capital letters, etc. in a user’s most recent 100 tweets) data to estimate the probability that users are bots.
Looking at the package’s code, we can see that the model’s features also include the number of tweets sent from different clients (iphone, web, android, IFTTT, etc.), whether the profile is verified, the tweets-to-follower ratio, the number of years that the account has been on Twitter, and a few other interesting characteristics.
I’ll obfuscate the Twitter handles for privacy’s sake, but they can easily be found by reproducing the steps in this analysis or by using a MD5 reverse lookup.
The following code calculates the bot-probabilities for the Buffer team’s accounts and sorts them from most to least bot-like.
The model assigns surprisingly high probabilities to many of us. The account @bufferlove is assigned a 99.9% probability of being a bot – the
@bufferdevs
and @bufferreply
accounts are also given probabilities of 90% or higher. Verified accounts and accounts with many followers seem less likely to be bots.
Working for a company like Buffer, I can understand why this model might assign a higher-than-average probability of being a bot. We tend to share many articles, use hashtags, and retweet a lot. I suspect that scheduling link posts with Buffer greatly increases the probability of being classified as a bot by this model. Even so, these probabilities seem to be a bit too high for accounts that I know not to be bots.
Let’s gather more data and investigate further. We have tweet-level data in the
tweets
dataframe – let’s gather user-level data now. We’ll do this with the search_users
function. We’ll search for users with “@buffer” in their bio and save it in the users
dataframe.
Once we have the user list, we can join
users
to the data
dataframe on the screen_name
field.
Let’s see how the probability of being a bot correlates with the number of followers that people have. We’ll leave our CEO, Joel (@joelgascoigne), out of this since he is such an outlier. He’s too famous!We can see that there is a negative correlation between follower count and bot probability. This makes sense – bots seem less likely to have lots of followers.
Now let’s look at the relationship between bot-probability and the percentage of Tweets sent with Buffer. First we’ll calculate the proportion of tweets that were sent with Buffer for each user.
The plot below shows the relationship between the probability of being a bot and the percentage of tweets Buffered.
We can see that there is a positive correlation between the proportion of tweets Buffered and the probability of being a bot. This is interesting, but not totally unexpected.
Definitely Bots
Now it’s time to see how the model does with accounts we know to be bots. I gathered some names from this site, which maintains a few lists of Twitter bots.
Surprise! They all have been assigned very high probabilities of being bots, because they are bots. The “tiny_raindrops_” account is 100% a bot.
Conclusions
We had a fun time playing with this package – thanks for following along! I could imagine something like this being used as a weighted input in a spam prediction model in the future, however the
botornot
model is imperfect as-is. We’ll continue to have some fun with it and will have to consider making some tweaks before putting it into production.
Thanks for reading! Let me know if you have any thoughts or questions in the comments below!
post credits: bufferapp
0 comments: