Searching Similar Books Based on Student’s Preference for Personalized Education

: Personalized education aims to give students a personalized learning schedule according to students’ backgrounds and preferences, and the required learning resources for learning are personalized. On-line bookstore allows students to collect learning recourses on-line through Internet


Introduction
Personalized education aims to give students a personalized learning schedule according to students' backgrounds and preferences [1,2]. Every student has his own preference while collecting learning resources, so the required learning resources for learning are personalized, even though they are come from a same major. For example, a student may want to get a book list in which the returned books are relevant to the topic "data mining", and another student may prefer the book list relevant to the topic "software engineering". There are many students who are searching the learning resources related to their personalized education processes, and it is difficult to find so many learning resources for satisfying the personalized preferences of students.
The rapid development of the Internet makes the learning resource collection of students more convenient as billions of learning resources are available online. On-line bookstore allows students to collect learning recourses on-line through Internet at home or some other places, which transcends the barriers of geography and makes the study process easier. Through on-line bookstore, students can get kinds of learning resources, including books, audio and video resources. Today, on-line bookstores, such as Amazon (https://www.amazon.com/) and China-pub (http://www.chinapub.com/), have attracted millions of students and helped provide them a large amount of valuable learning recourses. With the data of on-line bookstores becoming diverse and massive, the problem of information overload plagues us every day since it is difficult to find suitable books for learning.
Similarity search can be regarded as a promising way for efficiently solving the problem of information overload since it can effectively find the similar objects to a given object from large dataset. Similarity measures are the core task of similarity search problem, which can be divided into two broad categories: 1) content-based similarity measures treat each object as a bag of items or as a vector of word weights [3][4][5][6][7]; and 2) structural-based similarity measures, consider object-to-object relationships expressed in terms of links [8][9][10][11][12][13][14]. Compared to the content-based similarity measures, the link-based similarity measures produce systematically better correlation with human judgements [15]. When applying link-based similarity measures in on-line bookstore, the students can find the similar books by providing a book as query, which would simplify the learning resource collection over largescale on-line resources. However, the existing similarity search framework limit the input query into one object, the student can choose only one book as query. The query intentions of students cannot be thoroughly expressed since it is difficult for students to choose a suitable query object related to their personalized preferences.
In this paper, we study the similarity search problem in on-line bookstore and propose a personalized similarity search framework, towards finding the similar books based on student's preferences for personalized education. For satisfying student's preferences, we allow student express the query with multi-books. Based on students' rating for books, we build the student-book network and compute the similarities between books over the student-book network. We define a personalized similarity measure for measuring the similarity between query and candidate book by combining the similarities between books. Experiments on Amazon dataset demonstrate that, when the number of input books are not limited into one, the returned rankings are more consistent with students' query intentions.
The rest of this paper is organized as follows. Section 2 defines the student-book network and discusses the similarity measure between books. Section 3 gives the personalized similarity measure and describes the personalized similarity search framework. Experimental studies are reported in section 4. Section 2 discusses the related work. Section 6 concludes this paper and discusses the future work.

Similarity Between Books
For our further discussions on personalized similarity, in this section, we first give the definition of student-book network, and then discuss the similarity measure between books based on the student-book network.

Student-Book Network
In the data of on-line book stores system, there are many objects of different types, including books, categories and attributes of these books, and the relationship between these objects are diverse and complex. The relationships between the books of these types are diverse and complex. Among these objects of different types, the student and books as well as the "rating" relationship between them are more informative for measuring similarities, since the task of our research is mainly to find the similar books to student preference.
The "rating" relationship which means the students have rated the books. Based on the "rating" relationship, we next give the definition of student-book network. Formally, the student-book network is defined as: Definition 1 (Student-book network): A student-book network is defined as a bipartite graph G = ( ∪ , ) , where and are the set of nodes of students and books type respectively, and E is the set of links of "rating" type between students and books, i.e., ∀(u, v) ∈ E: u ∈ , v ∈ .
Usually, a student prefers a book if he/her rated for the books with high score. So the nodes of student and book types as well as the "rating" relationship between them are informative for measuring similarities between books, which is the base to find the similar books to the preferences of students.

Similarity Between Books
There many existing link-based similarity measures in recent work, including SimRank [8], SimFusion [9], P-Rank [10], PathSim [12] and NetSim [14]. Among existing link-based similarity measures, SimRank can be considered as a promising solution to measure the similarities between books in student-book network. The intuition behind SimRank is that "two nodes are similar if they are referenced by similar nodes", which conforms to our basic understandings. When compared to the 1-hop similarity measures [16][17][18], SimRank considers not only direct connections among nodes but also indirect connections, which can find more valuable underlying relationships.
The SimRank similarities can be computed iteratively. At iteration l, the similarity between a and b is denoted by * + ( , ). The iterative computation is started with * , ( * , * ), which is initialized as: * , ( , ) = 1 if a = b , and * , ( , ) = 0 for otherwise. And when l= 1,2, …, * + ( , ) is defined as * + ( , ) = 1 if a = b, otherwise: The time cost for computing the similarities of all node pairs at the l − th iteration is O(l7 8 9 8 ), and the space cost is O(9 8 ), where d is the average degree and n is the node number of a given graph. The iterative SimRank computation converges very fast, and there is little change in the returned rankings after five iterations [8].
When applying SimRank to student-book network, the intuition under the similarity can be described as "two books are similar if they are rated by similar students, and two students are similar if they rated similar books". During similarity computation, the similarity between books is computed by accumulating only the similarities between students, and the similarity between students is computed by accumulating only the similarities between books. Thus, the similarity between books is computed as: * , ( 1 , 8  0 for otherwise; and when l < 0, * + 1 , 8 is defined as * + 1 , 8 1 if 1 8 , otherwise: where ( 1 is the 1 ′ in-neighbor sets of student type, and ( 1 is the 1 ′ in-neighbor sets of book type.
The disadvantage of SimRank is the computational cost. With the student-book network becoming large, the computation of SimRank would be expensive in terms of time and space cost. Fortunately, there are extensive optimization techniques on SimRank computation in previous work [19][20][21][22][23][24], which significantly reduced the computation cost. For example, in our previous research [24], the reduction of the time and space cost of the iterative SimRank computation is on average 99.83%, accuracy loss is on average 0.02% NDCG, which can be used to optimize the similarity computation in student-book network. The similarities can be computed in the off-line stage, which would not affect the response time of query processing.

Personalized Similarity Measure
For supporting student preferences, we allow students express their queries with multi-books. Formally, the student preference is defined as: Definition 2 (Student preference): The preference of a student is represented by vector P G 1 , G 8 , … , G H , where the entry G I of vector P is either 0 or 1, and N is the number of books. G I 1 represents the current student prefers book i when inputting query and G I 0 represents the current student does not prefer book K.
The student preference is taken as query. When the query is not limited into one book, the definition of similarity between query and book would become more complex, since the query and book is not belong to the same type. For modeling the similarity between query and books, we define the similarity between query and book by combining the similarities between candidate books and the preferred books. The similarity between query G and book is called personalized similarity, defined as: Based on the personalized similarity, the students can express their preferences on different topics by choosing different books on different topics. For example, a student can choose some preferred books on "similarity computation" and "recommendation systems" as query, and the system returns the similar books to this query, which would be more personalized than the result when providing only one book as query.

Framework of Personalized Similarity Search
The framework of personalized similarity search is shown in Fig. 1. The process of the off-line and on-line stages are respectively shown in the below and above of the dotted line. In the off-line stage, the raw data is cleaned, including unnecessary links and noise data, and the "rating" relationship between students and books are chosen for building student-book network. The similarities between books are computed based on the student-book network, which are stored in a similarity matrix. In the on-line stage, the student input some preferred books as query, and system takes these books and transform them into the vector of student preference. The similarities between query and books are computed by combining the similarities between books, and then the candidate books are sorted according to the similarities. Finally, the system returns the top-k more similar books.

Setup
In this section, we compare our proposed personalized similarity measure ( PSR ) with the SimRank similarity measure (SR). The similarity computation in off-line stage are speeded up via partial sums function [20]. According to the literature, the decay factor are set as 0.8. Our experiments were conducted on a 2.30 GHz Intel(R) Xeon(R) CPU with 12 GB RAM, running Windows 8. All algorithms were implemented in C++ and compiled by using VS 2010.
We use Amazon dataset [25] to evaluate our approach. There are 355,601 products with 2,359,584 co-purchased relationships, 36,591 categories, and 42,890 terms appearing more than once. From which, we choose 5,521 users as students and 2,810 books with 18,901 links of "rating" relationship and 18,901 links of "be rated by" relationship.
We use Normalized Discounted Cumulative Gain (NDCG) [26]  for otherwise, where i denotes rank of ^I in the returned list, and r v,^I is set as: 2 (highly relevant), 1 (marginally relevant), and 0 (irrelevant). And the similarity levels are labeled in a double-blind fashion. Fig. 2 shows the NDCG values of both PSR and SR on varying k. For each algorithm, we use 20 queries to test the effectiveness. At each query, we indicate the expected topic of the returned books. Specifically, for SR, the student is allowed to choose only one book as query very time; and for PSR, the students is allowed to choose multi-books as query every time. We find that the NDCG increases with k increasing and finally becomes stable, this is because the rankings for different queries become relatively stable as W increases. We also find that, the NDCG values at different W of PSR are evidently higher than SR. Generally, when the number of input books is not limited into one, the returned rankings are more consistent with students' query intentions. Fig. 3 shows the NDCG values of PSR on varying query size 9 . We choose 10 queries at different sizes to test the influence of query sizes. Specifically, for each query, we limit the number of input books into 1,2, … ,10, respectively, and recorded the NDCG value for each query. From this figure, we find that, the size of the input query can really affect the effectiveness of the returned rankings, and in the range of 9 2 to 6, the NDCG values are relatively higher.

Related Work
There are extensive link-based similarity measures that can be used for measuring similarities between books. With respect to the focus of this paper, next we introduce some similarity measures that are most relevant to the current work.
SimRank [8] is a classical similarity measure proposed by Glen Jeh and Jennifer Widom, which defines the similarities between objects based on the intuition that "two nodes are similar if they are referenced by similar nodes". SimFusion [9] is one of the influent similarity measures for computing link-based similarities in heterogeneous network, which aims to combine relationships from multiple heterogeneous data sources. SimFusion computes the similarities iteratively over a unified relationship matrix (URM). Compared to SimRank, SimFusion utilizes the relationship for distinguishing link importance, but there only one type links in student-book network defined in our research, which makes SimRank more efficient and suitable for measuring similarities between books. P-Rank [10] enriches SimRank by considering both in-and out-links for solving improving the "limited information" of similarity computation and improving the effectiveness. The intuition behind P-Rank is that "two objects are similar if they are referenced by similar objects or they reference similar objects". C-Rank [12] ignores the direction of links when computing similarities, the meetings of both backward and forward directions are exploited for similarity computation in scientific literature databases. Both P-Rank and C-Rank can find more similar objects by considering the meetings of different directions, however, the student-book network is defined an undirected graph.
PathSim [13] assesses similarities in heterogeneous network by utilizing a meta path provided by users, which captures the similarity semantics among peer objects in networks. This measure allows users to measure similarities from different perspectives. HeteSim [14] adopts the spiritual of meta path, which can find similar objects from network to a query object of any type. Both of PathSim and HeteSim require users provide meta paths, which is difficult for the users to choose a suitable meta path especially when the network schema becoming diverse. NetSim [15] measures the similarities between objects based on the similarities between attributes, the intuition of NetSim is that "similar centers are linked with similar attributes". However, this measure suitable only the network of x-star network schema.
There are also some similarity measures that utilize the 1-hop neighborhood for similarity computation. Co-citation [16] measures the similarity between two papers in citation network based on the common papers which cite both of them. Formally, the similarity between papers is defined as the number of papers which cites them. And Bibliographic Coupling [17] defines similarity as the number of papers cited by them. Jaccard similarity coefficient [18] defines the similarity measures between two objects as the ratio of the common neighbors of their neighbors. These approaches use 1-hop neighbors for defining similarities. When compared to SimRnak, the indirect connections are not considered, which would ignores some similar results when find similar objects.
For fast similarity computation, a lot of optimization techniques are proposed. BlockSimRank [19] reduces the computation cost of SimRank by partitioning the graph into several blocks according to the block structure of graph data. By which, the similarity for each node-pair can be efficiently obtain from these blocks. D. Lizorkin and P. Velikhov [20] optimized SimRank based on partial sums, essential node pairs and threshold-sieved similarity. W. Zheng and L. Zou [21] proposed an efficient algorithm for finding the most similar object pairs in large networks. W. Yu and X. Lin [22] developed an incremental SimRank computation algorithm for fast similarity computation in dynamic networks. W. Yu and J. A. McCann [23] modified SimRank to compute the similarities for partial object pairs, which is important when only the similarities of partial object pairs are required in some applications. M. Zhang and H. Hu [24] proposed WebSim that reduces the computation cost of similarity search by limiting the iteration number into two, and uses a partial index to reduce the execution time of on-line query processing. These approaches can be easily taken into student-book network for speeding up the similarity computation between books.

Conclusion and Future Work
This paper introduced a personalized similarity search framework, which aims to find the similar books to student's preferences for personalized education. In contrast to traditional similarity search framework, our proposed approach allows students express queries by any number books according to their preferences. We integrate the student preference into similarity computation, and define the personalized similarity measure by splitting into the similarities between candidate books and the preferred books. Through the experiments on real datasets we conclude that, when the number of input books are not limited into one, the returned rankings are more consistent with students' query intentions.
There are numbers of directions in our future work. First, we would like to study the efficiency problem of on-line query processing, since the time cost of query processing would be significantly increased when the student-book network grows large. Second, we want to integrate the prerequisite relation corresponds to different books into similarity search to search more suitable books for personalized education. We can get the prerequisite relationship from the course schedule of some universities or learned from the purchasing behavior from the on-line bookstore. Third, we plan to apply our proposed personalized similarity search framework to other real datasets in some real applications, including literature search [26,27] and web search [28,29]. Our approach can be applied to any datasets of bipartite network schema besides the student-book network, such as product co-purchasing network [30,31] and bibliographic network [32].