Random Walk-Based Semantic Annotation for On-demand Printing Products

: Nowadays, the scale of real network is increasing day by day, while also brings sparse problems. It is usually necessary to maintain a large number of product information. To organize this product information, a feasible way is to add semantic tags to the information. In this article, we aim to solve the problem of semantic annotation of on-demand printing products. Based on good properties of random walk in global networks, we deal with the sparsity problem by applying it, and then propose an efficient ProRWR algorithm. Firstly, it processes the text description dataset of printed products based on TF-IDF algorithm, and builds “product-term” bipartite network. Secondly, ProRWR builds square matrix using the TF-IDF weight matrix, rewrite the equation of random walk, and use the normalized square matrix as the input of rewrite ProRWR algorithm. By random walks, terms with the highest convergence probability in each product document are selected as the most relevant feature terms of the product. A large number of experiments have been done on Amazon dataset. The results show that the precision and recall of our algorithm are 73.5% and 60%, respectively, indicating that ProRWR has discovered the potential semantic association and implemented the semantic annotation of on-demand printed products.


Introduction
Affected by the digitalization of the network, the traditional large-scale printing mode has been unable to adapt to the individualized market demand. Different user groups have significant differences in the demand for different products. On-demand printing refers to the sale of a reasonable number of products by pre-printing, and then timely supplementary printing according to the sales situation of the products and the needs of the public. However, faced with massive information of network products, users hope to search for a special product quickly and correctly. At the same time, the enterprise also hopes to print in time according to user's needs. How to efficiently and accurately obtain the required data from massive product data to improve the performance of information retrieval is one of the important issues that enterprises must focus on.
Semantic annotation [1] is an attractive method in machine learning and data mining, which is useful for indexing and organizing the product information. In the on-demand printing platform, semantic annotation is to add text tags to on-demand printing products, which facilitates the retrieval and management of products. The impact of semantic annotation on on-demand printing is mainly reflected in the proximity searches, recommendation, classification, clustering and other aspects of on-demand printing products. For example, by adding semantic tags to products, it is convenient to cluster or classify different categories of products. When users search for products, the system can accurately find related products based on semantic tags.
With the vigorous development of semantic annotation field, various kinds of research are deepening, and many novel tagging approaches are emerging. For example, content-based semantic annotation method and model-based semantic annotation method. Content-based approaches [2][3][4] mainly study how to combine the network metadata information, user comments, attention, clicks and other information during annotation stage. In contrast to the content-based algorithms, model-based algorithms [5][6][7][8][9] often use machine learning to solve the problem of semantic annotation. Broadly speaking, machine learning gives machine learning ability, which plays an important role in the identification of human diseases [10], classification of products [11], feature selection [12] and image processing [13].
Aiming to establish microblog user interest model, [2] combined clustering and classification algorithm to extract user interest tags and [5] proposed an approach of automatic document annotation with data mining algorithms: classification, clustering and named-entity recognition. [3] used content-based filtering method and distance algorithm for journal Recommendation System. [4] applied context information to alleviate the negative impact of data sparsity, and uses hierarchical relationships among products to mine users' potential preferences, and then models users in a specific period of time. An automation framework is mentioned in [6], which extracts product adopter information from online reviews and incorporates the extracted information into feature-based matrix decomposition to more efficiently recommend products. [7] employed association rule mining and Apriori algorithm for product prediction and recommendation. [8] proposed a concept-based automatic semantic annotation method for online BIM product documents. [9] used K-nearest neighbor algorithm to propose annotation methods of the image from the semantic neighborhood propagation label. These methods provide important reference for semantic tag generation of on-demand printed products.
Through the above analysis, we found that some existing machine learning methods [14][15] just only used to annotate the keywords that appear in the document, and cannot present terms that do not appear in the document. Among various annotation measures, random walk with restart (RWR) [16][17] provided useful node-to-node relevance scores by considering global network structure [18] and intricate edge relationships [19], which can discover potential semantic relationship between documents and terms. Moreover, RWR is a stable measurement standard and is not susceptible to noise and missing data. The traditional random walk-based algorithm has been applied to community detection [20], link prediction [21] disease detection [22], entity classification [23], image annotation [24] and other techniques. Based on above observations, this paper researches the semantic annotation of on-demand printing products based on RWR model.

Overview of Random Walk with Restart
Random walk with restart (RWR) [25] has become more attractive measure in the field of data mining and internet [26][27], which can discover potential entity tags and mine potential semantic relationships by calculating random walk distances that can be defined by relevance scores. The basic idea of RWR is to traverse a graph from one or a series of vertices, compared to some traditional approaches of calculating the distance on the graph, such as the shortest path method, maximum flow [28] and so on. RWR can capture the multi-faceted information between node pairs and obtain the overall structural relationship of the graph [29][30][31][32]. Figure 1 is an example of a traditional random walk model graph, denoted as , , where is the set of non-empty nodes of the same type, is the set of edges between nodes, and the weights of the set of edges are the relational weights between node pairs. In this case, all relationships between nodes can be mapped into the matrix ( represents the total number of nodes), the elements of which represent whether a node has a link with other nodes. Eq. (1) is the calculation equation of the relational matrix in Figure 1. If there is a relationship between nodes, the is set to 1, otherwise, it is 0. In particular, there is no relationship between the nodes themselves, so the corresponding value is 0. In this way, we generate the 7×7 adjacency matrix shown in Figure 2. Let be the maximum iteration step of random walk, ∈ 0，1 is the restart probability, then the matrix of random walk distance from to [18] is: where is the random walk distance matrix. According to this equation, the recurrence form of random walk distance matrix is derived as: At any node, the traverser will walk to the neighbor node with probability 1 and jump to any node in the graph with probability . After each walk, we get a probability distribution, which characterizes the probability that each node in the graph is visited. This probability distribution is used as the input of the next walk and iterates over and over again. When maximum iteration step is satisfied, a stable probability distribution [25] will be generated.

Bipartite Network Construction
The crucial question of product annotation in on-demand printing platform is how to construct a graph that can reflect the relevance between products and tags. The on-demand printing products and its descriptive terms can be regarded as the nodes in the graph, and the relationship between the nodes is edges. And then, the "product-term" bipartite graph is constructed. Figure 3 is a partial example of a bipartite network, denoted as ′ ′, ′ , ′ # ∪ % represents the set of products and terms, ~ ' are products, ( ~( ) are terms contained in these product descriptions. ′ is the edges set between products and terms, and the weights between edges are the relevance scores between node pairs. From Figure 3, we can find that: (1) is linked with ( , ( * , ( + but not with ( , , ( ) , which indicates that the document of contains ( , ( * and ( + , and these three terms have great potential relevance to product . (2) The term ( , is associated with , and ) , indicating that the product , and ) can be expressed simultaneously by term ( , , or that the two products may have similar features. (3) The product arrives at the term through odd steps, such as three steps from to ( ) : even steps are reached, product node * or + is reached.

Product-Term Weight Computation
TF-IDF is a statistical method and a commonly used term weighting method for information retrieval, which usually used to specify the term weight of a class of documents and to evaluate the importance of a term to a document in corpus. The importance of the term increases proportionally with the number of times it appears in the document, but decreases inversely with the frequency it appears in the corpus.
The term frequency (TF) refers to the number of times the term appears in the document. Inverse document frequency (IDF) means that if the number of documents containing the term is smaller, the IDF is larger, indicating that the term has a good ability to distinguish categories. The main idea of TF-IDF is that if a term or phrase appears in a document with a high TF and rarely appears in other documents, then the term or phrase is considered to have good ability to distinguish categories and is suitable for classification [33]. TF-IDF is actually the product of TF and IDF. The larger the product, the more the term reflects the subject of this document. In [34], a keyword extraction algorithm based on TF-IDF is proposed, which combines the semantics and statistical weight of terms to extract keywords. [35] decomposed the eigenvectors generated by TF-IDF algorithm into singular values, and carried out emotional analysis of micro-blog combined with LSA. Based on above discussions, the TF-IDF algorithm has general applicability in the extraction of feature terms.
The equation for calculating TF-IDF is as follows: ./ 4/ , ./ , 4/ where = , is the number of times the term appears in the description of product . ∑ = >, > is the sum of occurrences of all terms in the description text of product , ? represents the number of terms contained in product . |@| is the total number of all products in the corpus.
Next, we construct "product-term" weight matrix A BC , as shown in Figure 4. Here, each product description is represented by row vector, and each column corresponds to the terms. By Eq. (4-6), the ./ 4/ , value of the matrix can be obtained, which represents the relevance between product and term ( .

Initial Transition Probability Matrix Construction
From Eq. (2), it can be known that the basic idea of RWR is matrix multiplication calculation, which will involve the repeated transpose of the matrix, and lead to higher computational complexity. Therefore, in this section, we build a square matrix A to simplify the operation, as shown in Figure 5. The square matrix A can be regarded as a block matrix, which consists of A BC , A CB and two zero matrices. A CB is a term-to-product matrix that is transposed with A BC . It is worth noting that two zero matrices are placed on the main diagonal because the relationship between products and products, terms and terms is not considered. Next, for the convenience of calculation, A is normalized. The normalized equation is derived as: where A E represents the initial transition probability from product to term , K represents the TF-IDF weight of the L term in product row vector. In each product document, the total TF-IDF values of all terms are treated as denominators, and the TF-IDF weights of each term in product are taken as molecules. In this case, the new values are obtained as the new elements of the A E . Since the main diagonal is two zero matrices, the normalization of the square matrix A can be regarded as the normalization of the A BC and A CB matrices.

ProRWR Algorithm
In this section, we first rewrite RWR equation into a more efficient and simple form, which is derived as: where A E BC and A E CB are normalized form by A BC and A CB respectively. When 1, is as follows: In order to further improve the accuracy of product annotation, we next introduced an improved algorithm, ProRWR algorithm, as shown in Algorithm 1. Before preforming random walk, we construct square matrix in line1-4 and use A E as the initial transition probability matrix of ProRWR algorithm. Starting from a product node, random walk is performed on the "product-term" bipartite network in line 5-9. The speed of iterative convergence is determined by the restart probability .
is the relevance score between node and node , and is defined as the steady-state probability that particle stays at node after steps. After the end of the walk, the larger the convergence probability after the walk, the more representative the term is to the product. Algorithm  After iteration, we obtained the convergence probability matrix. The matrix needed is the A BC matrix in the upper right corner of the square matrix A E , elements of which are the convergence probability between products and terms, that is, the potential relevance score. By sorting these probability values, the top-R terms with larger convergence probability will be recommended as the tags of the product, that is, the semantic annotation of the product is realized.

Setup
The program code for algorithm is written in the Java development tool Eclipse based on JDK 1.8.0 and JRE 1.8.0 and run under a computer with a Windows server 2012 (a) system environment. The computer configuration environment is 128GB memory, 2 core CPUs, and each core has a frequency of 1.70 GHz.
The experiment was conducted on the Amazon dataset (http://snap.stanford.edu/data/amazon-meta.html). We extract the label of Amazon Standard Identification Number (ASIN) and title in the dataset. Each ASIN number corresponds to a product, and the title is a basic text description of the product. The experiment selected the product's ID from 1 to 50,000 for analysis and processing.

Evaluation Criteria
In order to evaluate the effect of semantic annotation, the precision and recall are used to evaluate the quality of the results, which are the commonly used evaluation indicators of information retrieval and recommendation systems.

Precision
Precision is used to measure the proportion of elements in the R i set that appear in the verification set T i , that is, how many feature terms are retrieved accurately. The precision is calculated as follows: where R i ∩ T i is the number of terms retrieved related to product ; R i is the number of terms chosen to recommend product , denoted as R; T i is the number of all terms contained in product documents; is the number of on-demand printed product selected for the experiment ( =50000), and the final precision is the average of the precision for all products.

Recall
The recall rate characterizes the proportion of the recommended feature term set to the verification set. The recall rate is calculated as shown in Eq. (11), and the final recall rate is the average of all product recall rates.

Experimental Results and Discussion
In Figure 3, regard product node as the starting node of the random walk, the term node can only be reached after odd steps, and the product node is reached after even steps. Therefore, to achieve semantic annotation of printed products, only odd steps can be taken to recommend the feature terms for the product. In this section, we discuss the setting of restart parameters and the influence of the number of selected feature terms on ProRWR algorithm performance.

Effect of Restart Probability
The ProRWR algorithm will converge in the process of iteration, and the rate of convergence is determined by the restart probability . Figure 6 shows the precision on varying . It is obvious that precision increases slowly and then decreases slowly with the increase of . The algorithm is most effective when = 0.8 since the closer is to 0, the more the random walk process can reflect the network around the starting point. The closer the is to 1, the more the local structure of the network can be reflected.  Figure 7 reports the precision and recall on varying iteration step , where =0.8. When the maximum iteration reaches, the probability that each term node is accessed tends to stable value, which suggests the convergence of ProRWR algorithm. In this article, the steady-state probability is used as the basis for recommending feature terms.

Effect of Iteration Step
Comparing TF-IDF value, the probability value at iteration 1 is equal to the TF-IDF value multiplied by the restart probability 1 , so the precision and recall rate are both lower. The precision and the recall have increased with increases from 1 to 3. Since then, with the iteration step increasing, there is no obvious change in precision and recall. Therefore, the experiment finally selects the experimental results at iteration 3 for semantic annotation of on-demand printed products, which has a high accuracy, and the annotation performance is much better.  Figure 8 shows the relationship between the precision and the recall when R takes different values, where =3 and =0.8. The experiment takes R= [1,10]. Usually, we hope that the higher the precision of retrieval results, the better the recall, but in fact, the two are contradictory in some cases. For example, in Figure 8, only one term is selected, then precision is higher, but recall is very low. If all results are returned, recall is higher and precision is lower. Therefore, when the recommended number of selected tags is 4, we can get 73.5% precision and 60%recall.

Case Study
ProRWR algorithm is proposed to solve the problem that users can't reasonably choose their own products because of the large number of on-demand printing products. Experimental results show that when the iteration step is 3, the restart probability is 0.8, the proposed algorithm can ensure better precision and recall, and thus maintain high precision of product annotation. Table 1 gives 3 examples of using this algorithm to annotate on-demand printed products in Amazon dataset, from which we found that most tags in the returned lists are highly relevant to the products. For example, "ballroom dancing" is the product title, the tags recommended by our algorithm is "dancing, ballroom, latin, fingerboard, slow, teach, kick, session, midnight, stars", which are related to "ballroom" and "dance" even though there is no common term in products' title. This indicates that ProRWR can discover potential terms and remove irrelevant terms. Some cases can be found in other products to prove the accuracy of our algorithm's semantic annotation. According to the feature terms recommended by ProRWR algorithm, we can see the main or similar descriptive information of the product, and then facilitate different users to find products that meet their needs.

Conclusion and Future Work
This paper proposed a ProRWR algorithm based on RWR for semantic annotation of on-demand printed products. The method constructs a bipartite network based on TF-IDF algorithm to represent the relationships between products and terms, and combines RWR to discover the latent semantic association between them. Experimental results show that the proposed algorithm can solve the problem of semantic annotation in on-demand printing platform more accurately.
The contributions presented in this paper are different from existing approaches, such as keyword extraction, text categorization and TextRank. This work based on RWR has both theoretical and practical implications, which is also due to its following advantages: (1) RWR is a stable measurement standard and is not susceptible to noise and missing data. (2) It can recursively capture the multi-faceted information between two nodes in the graph. When the data is extremely sparse, it can effectively capture the transition probability between nodes and obtain the overall structural relationship of the graph. (3) RWR can effectively discover potential product tags and reduce the redundancy of tags.
There are still some limitations in our researches. (a) ProRWR can only be used in smaller networks due to memory constraints. When the number of network nodes increases sharply, the RWR-based algorithm requires a large amount of computational cost and cannot deal with more documents. (b)users want instant feedback when they get information, which does not take into account the efficiency of real-time recommendation. (c) ProRWR focuses on only the networks that are static. When network changes, ProRWR need to recalculate the relevance matrix, which has high complexity, which leads to huge computational and storage challenges.
In future work, the following aspects need to be studied to overcome the above limitations. Firstly, we will explore how to improve the RWR algorithm's ability to process large-scale data in order to cope with the expanding social network scale. Secondly, we need to speed up user query to meet user real-time response needs. In addition, the researches of dynamic graph should also be considered. In future research, the method in this paper can also be applied to commodity recommendation in e-commerce platforms, video push, and image annotation, intelligent question and answer, etc.