Hi, let me walk you through the process of building a content based recommender system. Watch you steps, the work is still in progress. With the help of content based approaches, you will be able to generate quite good recommendations. So it is something that you will be able to share with your friends and to use personally. It's up to you how much you want to share. As I have already told you during the introductory video, content is not always a textual representation of items such as a movie plot, title and description. Content in this environment means fixed size feature space for an item. If an item is a book then it can have attributes such as book's author and publisher. If an item is a movie, then the list of attributes will likely include the movie director, film location, and budget. To build a content based recommender system, we need to answer three question. How to generate Items' representation. How to build users' representation in the same feature space. And how to measure the distance between a user and an item. Let us discuss these questions from top to bottom. If you have a numerical or categorical item feature then you don't need to do anything with it, except maybe one [INAUDIBLE] coding or a similar trick, use it as it is. If you have a textual item description or users' feedback, then you can take a look at the back of model. Do you remember TFIDF and from the previous lessons? Yes you do, but there is a better way in case you have a lot of textual information. Multimedia data processing, including text data processing, is where deep learning embedded tool kits show their best advantage. Spark provides Word2Vec algorithm implementation to generate fixed size vectors for words in a text corpus. Doc2Vec or Paragraph2Vec algorithm is a bit more complicated. Never the less, the relevant issue is still open at the time of shooting this video. We can use the following simple approximation. You're either trained or used already pre-trained Word2Vec model to generate vixed size vectors for each word. And then you aggregate words for each item with for example, idea the weight. As a side note, to outperform the existing pre-trained Word2Vec models you might have to collect a huge dataset and spend a lot of resources to optimize the model. For example the Google news dataset consists of 100 billion words. Do you have a bigger collection? Then the next question is user's representation. In a standalone content-based model, user is just an aggregate of items he or she has rated. Aggregation can have different forms. You can simply average of all the available item ratings. You can use ratings as weights and additionally scale these ratings. Ratings could have time stamps. So you can decay item ratings the way I described to you in non-personalized recommender system video. In real life, you usually start with the simplest implementation. And then tune your approach according to your service requirements. The benefit of a content based approach is that you can easily explain user's vector unless you use complex embeddings to generate feature space. For example, if you have coordinates representing general action then users with high values in this coordinate like action movies. More than users with low values in this field. Vice versa, you can have users' representation in some feature space and build item representation by aggregating user feature vectors. You can also have mixed user and item feature space and build the so called model based recommender systems, but it goes beyond of our discussion. The last and the most interesting question is how to measure the distance between a user and an item. This distance, or similarity, influences a rating prediction for item i by user u. The whole set of distance metrics is available for your experiments. My personal point of view is that Python Scikit library which you've seen in the first week of this course, provides a really good classification. There are metrics intended for real-value vector spaces such as Euclidean, Manhattan and Chebyshev distances. Metrics intended for integer-value vector spaces. For instance, Hamming distance which counts the number of known equal vector components. Metrics intended for boolean-valued vector spaces, my favorite are dice and Jaccard distances. Of course, if you are not able to find something that suits your needs you can implement your old distance metric for your application. Overall you can use the following diagram as a reference to make sure that you forget nothing important while building a content based reccommender system. What are the benefits and drawbacks of this approach? Compared to non personalized recommender systems, these types of recommendations is already personal. If you have user or item representation, then you overcome the cold start problem. But please bear in mind, if you have an item description and build a user's representation by averaging item feature vectors, then you only overcome the item cold start problem. If you have a new user then you don't have annual ratings. Therefore you are not able to average anything and build a user's representation. If you don't use complex embeddings, then by looking at a user's profile, you can describe his or her interests. Content based recommender system cannot easily tackle interdependencies. Imagine you have several attributes such as movie language is English, movie language is A, which is your native language. Movie genre is cartoons and movie genre is documentary. If you have started to learn the English language you can like watch cartoons in the English language. You're also a fan of documentary movies. But due to language complexity you prefer to watch them in your native language. You have high scores for all of these attributes. And this stand alone content-based recommender system will recommend to you cartoons in language A, and documentary in the English language. It is not something you expect. There are several ways to deal with these kind of problems. The first approach is to use more complex matrix which account for interdependencies. But your future space can expand quadratically. The second and more generic approach, is to use collaborative filtering algorithms. How to build them, watch the next videos. You are warmly welcome. And don't forget to provide your feedback to make this videos better. Overall in this video you have learned how to build users representation in items feature space or vice versa. What distance metrics are available for different scenarios. And how you can reason about benefits and drawbacks of the content based recommender system and analyze it if it is appropriate for your use case.