GlobalData has found that around 10 per cent of Twitter’s active accounts are posting spam contents using a mathematical model.
The leading data and analytics company noted that this is double that of Twitter’s reported figure. This is likely due to a difference in criteria as to what counts as ‘spam’.
Sidharth Kumar, Senior Data Scientist at GlobalData, said:
“What is or is not spam is suddenly an important discussion point for the social media platform, given that Elon Musk’s bid to take over Twitter is now on hold due to a disagreement on the proportion of spam accounts on the platform. Twitter claims that bot/spam accounts on Twitter represent less than 5 per cent of accounts while Elon Musk’s team thinks otherwise.
“The precise proportion of spam accounts is difficult to compute, as it is almost impossible to confirm the identity of the entity behind a tweet handle. Additionally, the definition of spam account may differ for everyone. Incessant tweeting of non-original content can be considered spam, but some may choose to see it as a very active user sharing articles/opinions.”
Sidharth Kumar
Parameters for Model
Taking all these into consideration, GlobalData’s mathematical model estimated the number of spam accounts using multiple parameters to provide a weighted score, which was then used to determine the classification of ‘spam’ or ‘non-spam’.
These parameters were chosen by focusing on the differences in activity between typical spam accounts and that of an average Twitter user, GlobalData noted. Accounts performing poorly on many parameters received a higher score, indicating a higher probability of being spam.
GlobalData analysts then independently observed handles at different score levels, and decided the cutoff for the classification (‘spam’ or ‘non-spam’) by consensus.
The parameters used in the model were 10 in number, namely: a) is the tweet handle verified? b) is a tweet coming from third-party avenues? c) What is the number of historic Tweets that the handle has produced, divided by the days since its creation? d) How frequent were the last 200 tweets? e) What is the proportion of retweets in the last 200 tweets? f) Of the last 200 Tweets, how many did not contain any hashtags or links? g) What is the standard deviation in typical tweet length? h) What is the median time between two tweets? i) What is the length of the description in the profile? j) Of the last 200 Tweets, what is the proportion of links shared?
“There were a few research pieces published earlier in the media looking at the followers of certain handles to estimate spam or bot proportions. We felt that the correct approach would be to analyze samples of live streams, as that is more indicative of Twitter activity.
“Our estimate is conservative, as we wanted to be sure that we were correctly identifying accounts as spam. It is important to note that this is still an estimation. There is no conclusive way of knowing if a certain account is a bot or spam.”
Sidharth Kumar
READ ALSO: World Bank Warns of Recession risk over Russia-Ukraine war