This recommended pages feature on facebook is kind of weird, when I get them on amazon or netflix, they are specifically recommending products. When I get them on facebook, they are basically recommending ways for me to live my life.
“We heard you like swimming!! Have you tried Walking??”
This is a great example of programming without syntax. Abstract syntax trees define the behavior of the code, and the task of the programmer is simply data entry.
I’m definitely still working on a reddit recommendation algorithm, but progress has slowed because of work constraints. I’ll post more eventually.
I was having trouble getting my iPhone to sync, it was failing. I *Might* have figured it out.
I was getting errors like “A folder was specified instead of a file” or “the required file cannot be found”.
First, I tried clearing out my photo cache in my My Pictures directory. That didn’t fix it. Next, I tried changing the usb cord or usb port. That didn’t fix it. Finally, I realized I had a P55 board and hadn’t updated the bios, some forumgoers stated that updating your P55 bios would fix the problem. In addition to updating my Gigabyte UD4P bios, I updated my intel chipset firmware (and, whynot, my SSD firmware).
Afterwards, it seems to work much better. Ugh, what a pain.
Still working on producing some Reddit Recommendation results.
I think some of my parameters for normalization are off: Features are over training into negative values. I should have time to look at it tonight. Although I’m able to reduce RMSE, it doesn’t seem to be producing reasonable predictive results I expect to be achievable.
I used this script I used to enable Windows 7-esque snap functionality in Ubuntu (well, in any OS with the Compiz manager)
Ok! Tonight I have a working, outputting reddit recommender implementation.
I’m going to test it in the coming days, here is what I did:
- First, I transform the data from user and article names to integer ids, ordered by the order of the original file.
- Next, I split the data N ways for cross validation
- Next, I load the dataset into a dictionary structure (google sparse hash map) in about 80 seconds, this allows fast access by user (User set size: 31055 Item set size: 1913683)
- Now, while it runs, it outputs user features and article features
What’s left:
- Write a separated test system which loads a feature file and a test file and outputs the RMSE of the trained features, as well as the “percentage right”, when just rounding off predictions (not rounding off is useful for prioritization). I plan to post the results of this soon, Wednesday if everything works out.
- Convert the system to be asymmetric SVD, using only item(article) features, so that the system can be used online. This is not hard but slightly slows down training, yet new users can be very quickly identified without data warehousing.
So far, I’ve managed to get my C++ Learner to compile in monodevelop, my C# cross validator also compiles. MonoDevelop has been an absolute pleasure to work with, I’ve had very little difficulty getting my build configuration organized.
Currently, my learner is broken, because it parses integer indexed files, whereas the reddit data is provided as string usernames. I’m trying to decide if I just want to take in the usernames, or just convert them over to integer ids before the recommender processes it. The former is simpler, and more robust, the latter is more resilient to memory issues, and is easier to write (the C# cross validator would transform it.).
I wanted to run my algorithm in Ubuntu — Ubuntu is useful for the algorithm, because I use a library designed by google, and Ubuntu allows one to include this library with a simple apt-get. I had a problem where Ubuntu would not boot. Instead, it would go to a command prompt with the error “root.disk does not exist depending on shell”.
I believe what happened was that I had shut down improperly earlier. This occurred because I was trying to change my graphics resolution, my screen went black, and I had to hard-shutdown.
To fix this, I went to windows and went to start -> cmd -> and typed “chkdsk /f” on the drive ubuntu was installed (installed through wubi, I did not want to partition).
Later, I booted to a live cd, then exited ubuntu, and finally when I tried rebooting, I had no issues. I am not sure if booting into the live cd fixed my issue, or if running chkdsk did.
Anyways, I am now up and running with GCC and Code::Blocks!
Sigh, that was a huge pain.
Tonight I’m going to work on the reddit recommendation algorithm. Updates to come.
Getting C++ to work with google sparse hash in windows has been kind of a pain — In the interest in just getting this reddit recommender working, I’m just going to reinstall Ubuntu on this machine. The C# cross validator works. The C++ recommender has been rewritten to be more generic, and take in the cross validator’s style of input files. I have some slight debugging to do, and a phase of testing.
As this is a busy week for me, I expect to spend some time Tuesday porting it over, and finally to spend Sunday testing it. I will try to make a post on reddit about my results on Wednesday of next week.
I’m done building my cross validator for solutions to the reddit voter data, built in C#. Now to update my netflix recommender to take the reddit voter file. My previous solution was compiled with gcc, this time I am trying to use visual studio, so I have to figure out how to link google sparse maps to visual studio properly. It shouldn’t be too hard, but I might not be able to do that until some time later this weekend.
Afterwards, I’ll write some analysis tools which use the output featuresets and allow one to see what users are similar to a given user.
Finally, I’ll do a write up of the whole process, and a write up on what benefits one would have to gain form having the current followed subreddits. (a subreddit recommender)
Karmachode asks:
I’m curious, how could this data be used to recommend articles when each new article gets a brand new ID? This is unlike Netflix where recommending old movies is fine. In this case if you recommend old articles it isn’t of much use.
What I was trying to do today is create clusters for recommending people rather than for articles. I agree that the end goal should be recommending subreddits.
You’re sort-of right that recommending old reddits isn’t the goal in this process, but neither is clustering.
When performing machine learning, the first thing to ask yourself is what questions you need to solve. What we’re trying to do is classifying a list of frontpage articles: to provide for each of them a degree of confidence the user will like it, and to minimize error (in the MSE sense). What you are proposing is a nearest neighbor solution to confidence determination. What I intend to do is iterative singular value decomposition, which discovers the latent features of the users. It’s a bit different, but it solves the problem better. For new articles, describe them by the latent features of the users who rate them, then decide which article’s latent features match the user most accurately.
Reddit just released public voter data, I’m going to build a recommender this weekend.