It’s been long since I last wrote blog entries in English. I still feel that I express better in Chinese. After all, that’s my mother tongue! I assume some of my readers are bored of Chinese-only posts (let me assume!), so I’m writing in English again.
Disclaimer I am in no way trying to reverse engineering how Sina Weibo implemented their features, nor am I saying that I know how the features are implemented. The entry is just my thoughts on how the features could be implemented.
Recently, Sina Weibo (if you don’t know it, replace the product with ‘Imaginary Twitter’) gradually pushed to its users a feature called ‘我经常访问的人(仅自己可见)’, which has not been officially translated as of the writing of this entry. It, unofficially and according to me, translates to ‘frequented visitees (only visible to yourself)’. The feature name should be quite self-explanatory.
The feature reminds me of two interesting algorithms that I have seen in the past.
A problem from The Beauty of Programming
The Beauty of Programming is a book by a group of people at MSRA, telling the stories of interviews of former interns. One of the algorithm puzzles is:
You are given the list of post authors in a forum and you know that there is someone (King of Water) who has created more than a half of the posts. How do you find the one?
The designated solution is ‘perfect’ in the sense that it is one-pass, linear in time and takes only constant additional space. It will be bothering to describe the solution now because the algorithm is a simple variant of the one mentioned in the next section.
Stream counting problem
This algorithm is introduced to me during my exchange program at Aarhus Universitet by Kasper Green Larsen. Imagine you are given a sequence of data. The algorithm processes data in a streaming fashion and approximates the proportion of objects in the stream. It goes as the following:
- Initialise empty bins , each capable of containing an object (denoted by itself) and a counter (denoted by , initially zero), and a total counter (, the number of objects till now).
- When given an object , increase by one and:
- If for some , increase by one.
- Otherwise, if for some , let and set to one.
- Otherwise, decrease all the ’s by one.
- When asked to report proportion of :
- If for some , report the proportion as .
- Otherwise, report the proportion as zero.
Obviously the reported value is no greater than the real value. By amortised argument, the reported value is no less than the real value minus : the reported count deviates one from the real value only when branch 3 in the processing step is taken; each time branch 3 is taken, the sum of the counters decrease by ; the total increase cannot exceed and the counters never get negative, therefore nor can the total decrease exceed ; from these we conclude that branch 3 is taken for no more than times, thus the bound.
To solve the problem in the previous section, use one bin. In that case, the ‘King of Water’ will always be the one kept in the bin after all the data are processed.
Application to Sina Weibo’s scenario
The user’s visiting data come in stream, which is why we naturally come up with the ideas of streaming data processing. When dealing with real-world users, we must also consider the meaningfulness of results. When we say ‘frequented visitees’, not only should they be the accounts one visits the most often, but they must also be significantly more often than the others. Suppose a user has visited 5 accounts twice and 100 other accounts once. It is not appropriate in a daily sense to report the 5 accounts as frequented.
A simple model is to report, say, 10 top accounts, on each of which the user spends at least 9% of total time spent on the homepage of others’ account. Weibo can estimate the frequency with the aforementioned algorithm with 100 bins and find the top ones.
N.B. The algorithm does not guarantee to preserve orders! However, for real-world application, I believe the algorithm is a good enough approximation. The threshold I chose is interesting in the sense that there can be at most 12 persons being the candidate (hint: the algorithm shold consider all people with reported frequency at least 8%). It is the case for me that the tail few accounts aren’t really considered ‘frequented’ by me, and the top accounts are really the ones I visit often.
To further refine the model, we have to take time into consideration. Suppose the user is inactive for a long time, the ‘frequented visitees’ might not be meaningful because of the discontinuity of his browsing. If we set the weight of visiting to be decreasing exponentially in time, we just need a simple tweak in the algorithm to take care of decaying and weighted counting.