Topic Characteristics of Large-Scale Online Public Opinion Based on Coword Networks and Event-Driven Methods

: In June 2019, Anti-Extradition Law Amendment Bill (Anti-ELAB) Movement occurred in Hong Kong. The movement generated a huge impact on online public opinion. This online public opinion lasts for a long time and has a wide range of influence, which is often called large-scale online public opinion. There is a lack of research, as well as a limited research perspective, on large-scale online public opinion. In order to study this kind of large-scale online public opinion. Therefore, starting from the topic perspective, this study investigated topic evolution and spatiotemporal characteristics using coword networks and event-driven methods. The proposed methods were applied to a case study based on the corpus related to the Anti-ELAB Movement on Sina Weibo. The results revealed public opinion hotness trends and their influencing factors, as well as the topic content, evolution characteristics, and spatiotemporal characteristics of the three evolution stages of the Anti-ELAB Movement. They also revealed the guiding role of events in topic content and evolution and discovered the clustering characteristics of the topic’s spatiotemporal hotspots. In the whole process of large-scale online public opinion, the content of online public opinion changes according to the secondary events, and the space-time hot topics are also related to the events.


Introduction
Social media has increasingly become a venue for the expression of public opinion. As a result, public opinion has developed into online public opinion. By recording a huge number of comments on large-scale public opinion events, social media sites end up storing a massive corpus of text. A large-scale public opinion event is an event with a large span, a wide spatial influence, and many secondary events. Online public opinion corresponding to a large-scale public opinion event is called large-scale online public opinion. The Hong Kong Anti-ELAB (Anti-Extradition Law Amendment Bill) Movement that began in June 2019 is a typical large-scale public opinion event. It triggered extensive discussions on social media platforms, and substantial public opinion data were recorded. Topic characteristics and evolution patterns extracted from public opinion data can guide public opinion monitoring. Therefore, investigating such phenomena has great significance for large-scale online public opinion research as well as public security. Research on the topic evolution of public opinion mainly uses methods such as clustering, topic models, and coword networks. For example, Jayashri et al. proposed topic clustering and topic discovery algorithms based on time parameters [1]. David et al. used a (latent Dirichlet allocation (LDA) topic model to model Twitter tweets [2] while Ni et al. proposed an LDA-affinity propagation (LDA-AP) topic evolution model [3] that combined clustering-and topic-based methods. Li et al. proposed a topic evolution research framework that combined coword networks and community discovery [4].
The abovementioned methods nevertheless have some limitations. The clustering-based method is random, and its introduction of interference information affects its accuracy. The topic model-based method must determine the number of topics in advance. However, large-scale online public opinion spans a large number of topics, which means the topic-model method will inevitably miss topics. Further, while the coword-network method can present a scientific cognitive structure [5], its algorithm still needs improvement. Specifically, it can be improved in two areas: First, most researchers use word frequency or subjective judgment to extract keywords [6][7][8], which is subjective and lacks accuracy in calculating keyword weights. Second, when detecting topic evolution, the method sets a threshold [9]. We aimed to use a topic detection method that had no threshold so that the weak connections between topics can be maintained as much as possible. Finally, regardless of the method that is used, most research in this area lacks a framework for topic analysis, and the investigations are superficial.
The sina weibo tweet data about the Anti-ELAB event used in this article was manually obtained. It only contains data from tweets posted by users and does not contain other sensitive information such as user names and addresses. According to the sina weibo service agreement (https://weibo.com/signup/v5/protocol), users are allowed to browse the sina weibo web page and manually retrieve the tweet data; then, ITF/PDF (integrated term frequency/proportional document frequency) [10] can be used to extract text keywords. Finally, the coword-network method can be used for topic detection and topic evolution detection, and it can remove the threshold during topic evolution detection to retain the subtle associations between topics as much as possible. Large-scale online public opinion involves many topics, many secondary events, and a large time span. Therefore, an event-driven topic-analysis method was used to interpret topic characteristics in depth. Using these methods, this study aimed to explore the evolutionary patterns of large-scale public opinion events and provide relevant organizations with a theoretical basis for public-opinion response measures.

Technical Path
As shown in Figure 1, the technical path was divided into two stages. The first stage uses an improved method based on the coword network to extract topics; the second analyzes the topic characteristics of the extracted topics, including topic content evolution and spatiotemporal characteristics.

Topic Discovery Based on the Coword Network
Widely used in the field of text mining, coword networks can effectively discover and visualize public opinion topics [11]. This method includes three processes: topic word screening, coword network construction, and topic community detection.

Topic Word Screening
This study used word frequency along with the ITF/PDF method to extract keywords. ITF/PDF is a simple and effective method for extracting keywords from multiple documents. The algorithm formula is as follows: where Weight i is the weight of word i, N is the number of documents contained in the document set, n i is the number of documents containing word i, K is the number of words contained in the jth document, and tf is the frequency of word i in document j. First, initial screening was performed based on word frequency. Then, ITF/PDF was used to perform secondary screening on the initial screening results. Keyword weight is expressed by the ITF/PDF scores.

Coword Network Construction
A coword network is a network structure constructed based on the cooccurrence relationships between words in documents. Figure 2 shows the steps for constructing a coword network. Assume there are six documents, and a total of six words are involved in the document. d is the ith document, and w is the jth word. If the words w 1 and w 2 exist in document d i , then in the keyword-document matrix, the cell value in w 1 row and w 2 column is 1, and the words w 1 and w 2 cooccur in d i . According to cooccurrence frequency (cooccurrence in one document is recorded as one time), the keyword-document matrix can be converted into a keyword cooccurrence matrix. Finally, a keyword cooccurrence network is constructed based on the keyword cooccurrence matrix, where nodes are keywords, and edges are the cooccurrence relationships among words. If the words w 1 and w 2 cooccur, the nodes are linked, and the weight of the edge is expressed as the number of cooccurrences.

Topic Community Detection
If words are linked more closely in the network, the topics expressed in the text corresponding to those words are more similar. Based on the above principle, words are divided into communities, and the topics expressed by the words represent those expressed in the corresponding text. This study used the Louvain algorithm [12][13][14] to divide coword network communities. The Louvain algorithm is a greedy modularity optimization algorithm that involves modularity Q and modularity gain ∆Q. The calculation formula is as follows: where m is the total number of edges in the network, Q is the modularity of the network, A @ ,@ * is the weight of the links between nodes (keywords) w A9B w , k @ A9B k @ * represent the sum of the weights of all links to keyword nodes w 1 and w 2 , and δ0c @ , c @ * 2 is the Colorado function. When nodes w 1 and w 2 belong to the same community, δ0c @ , c @ * 2 = 1; otherwise, its value is 0. c @ A9B c @ * represent the communities to which nodes w 1 and w 2 belong, respectively. 7# is the modularity increment of the network. n G,H is the number of all links between node v and community C, and w is a point linked to v in community C. D v is the degree of point v. D w is the degree of point w.
The flow of the Louvain algorithm is as follows: 1. Initialize the community, treat each node in the community as a separate community, and number it. 2. Randomly select node i from the network, find all neighbor nodes of i, calculate the modularity increment ∆Q after node i joins the neighbor node, and select the community with the largest modularity increment to join node i. 3. Repeat step 2 for all nodes in the community until there is no more modularity gain. 4. Regard the community obtained after node merging as a super node and the link between communities as the edge of the super node; repeat step 2 until there is no more modularity gain.

Topic Evolution Detection Based on the Coword Network
A keyword coword network for two periods was constructed to detect topic evolution. As shown in Figure 3, there are three topics, C , C , and C K , at stage t 1 and three topics, C , C , and C K , at stage t 2 . Words w 1 and w 4 in C are linked to the keywords in each topic at stage t 2 , representing the cooccurrence of keywords between topics C and C . The weight of the edge is the keywords co-occurrence times. The topic evolution between C LM and C L can be expressed as Guoqing Liu and Weihong Li: Topic Characteristics of Large-Scale Online Public Opinion Based on Coword Networks and Event-Driven Methods where [ is the proportion of the evolution from topic C LM to topic C L , C LM is the ith topic, at stage n-1.

Discovery of Spatiotemporal Topic Hotspots Based on the Event-Driven Method
Events drive topic generation. As shown in Figure 4, event A occurs at t 1 , and witnesses publish information online. At t 2 , media report on the event. News and reports spread through ∆t 2 . Finally, witnesses and media reports trigger the spread of public opinion, and a topic is formed at t 3 . A topic reflects the content of an event. For large-scale online public opinion, many secondary events occur in the life cycle of public opinion, and secondary events will trigger topics. In addition, the occurrence of an event must include location attributes; thus, it is possible to connect time, space, and topics through an event. This study first conducted time-series statistics on topic-related texts to obtain time hotspots. Then, events and their corresponding locations were found based on time hotspots.

Study Area and Data
The tweets we used comes from Sina Weibo's data on the Anti-ELAB Movement (6.9.2019-10.23.2019). The tweets text was classified with supervision to remove irrelevant text (such as advertisements and celebrities), and the remaining text contained 2,501,016 entries. This text was tokenized, and stop words were removed. Custom words such as "Carrie Lam," "antiviolence," "save Hong Kong," and "page link" were added during tokenization. When stop words were removed, "page link," "Weibo video," and "forward" were added to the stop word list to remove noise that had come with Weibo. Second, a table of events related to the Anti-ELAB Movement was collected from Wen Wei Po (6.9.2019-10.23.2019), which contained the time and place of each event.

Discovery of Public Opinion Regarding the Anti ELAB Movement
In the life cycle of public opinion, topics are different at different stages. This study used the number of posts to represent public opinion hotness for time-series statistics. Then, Hong Kong's Anti-ELAB Movement was divided into three stages: the incubation period (6.9.2019-7.15.2019), the outbreak period (7.16.2019-8.25.2019), and the recession period (8.26.2019-10.23.2019). As shown in Figure 5, the number of posts during the incubation period fluctuated slightly. During the outbreak period, the number of posts increased rapidly, and public opinion hotness soared. During the recession period, the number of posts showed an oscillating downward trend, and the peak was related to secondary events. These findings are consistent with the results of previous studies [15]. Comparing the number of posts with the "Hong Kong" Baidu Index revealed that the number of posts had the same trend as the "Hong Kong" Baidu Index.
To extract the topics of each stage, the topic-discovery method, based on the coword network, was used for the three stages. Then, the Gephi package was used to discover the topic community. Table 1 shows the results. Regarding the ELAB, the topic during the incubation period was related to demonstration issues arising from the ELAB and the reason for the ELAB. During this period, public opinion hotness was generally stable and accompanied by two small peaks mainly related to demonstrations by people in Hong Kong. In terms of topic content, netizens were relatively sensible. "sensible" means that people behaved peacefully. According to Wen Wei Po, there were 12 demonstrations during this period, so there was little reaction from mainland netizens. As Figure 5 shows, the number of post count during this period was also low (9.2% of total post count). Although there were topics insulting radicals who participated in the demonstration, overall the discussions were mostly sensible. Tan et al. found that during the incubation period, Hong Kong netizens were in a sensible state, whereas during the outbreak period, their expressions gradually became radical and extreme [16]. Based on their research, we believe that the sensibility of the topic content stemmed from the sensible of Hong Kong netizens in the early days of the Anti-ELAB Movement.  The topic of the outbreak period was centered on "demonstrations." The intensity and frequency of demonstrations increased and violent acts occurred. Radical protesters clashed with the Hong Kong police and damaged national flags and emblems. The protests affected the emotions of mainland netizens. As shown in Figure 5, the public opinion hotness increased rapidly during the outbreak period, when sensible demonstrations gradually turned into insensible ones. During this period, more and more celebrities began to speak out in support of the Hong Kong police, and positive topics began to form. "positive" means benefit for the Hong Kong government or benefit for the Chinese government. For example, "supporting the police" and "guarding the flag" are expressions of support for the Hong Kong government and the Chinese government. During the outbreak period, Wen Wei Po published 120 "police support" news and 215 "love Hong Kong" news. Such as " Hong Kong cannot continue to be in chaos. Let's protect the rule of law and support the police.". As netizens participated in topic discussions, topics such as "supporting the police," "guarding the flag," and "antiviolence" emerged.
Topics during the recession period also focused on "demonstrations," and the number of topics shrank compared to the outbreak period. The six topics during the recession period involved "harm of violence to society," "demonstrations," "Hong Kong police, aggressive youth," "education on Hong Kong independence," "love the motherland, love the national flag," and "freedom and democracy." The topic content thus reflected the event.
For instance, topic 0 reflected violent activities, topic 1 reflected the demonstration, topic 2 reflected clashes between police and youth, and topic 3 reflected celebrity opinions. In the event reflected in topic 4, there were Weibo topics such as "everyone is a flag guard" and "the five-starred red flag has 1.4 billion flag guards."

Topic Evolution Characteristics
Topic evolution detection based on the coword network was used to calculate the topic evolution table of the incubation period and the outbreak period. Based on the evolution table, an online mapping website was used to make a topic evolution alluvial diagram, as shown in Figure 6. Each color block represents a topic, the height of the color block is the topic intensity, and the band is the topic evolution. In the process of evolution, there are correlations between topics at different stages, but the degree of correlation differs. For example, topic 0 in the incubation period belongs to the influence topic of the Anti-ELAB Movement. It is slightly related to derivative topics (Shenzhen, senator) from the outbreak period and significantly related to event topics (guard, HK police). The reason for this is that any topic on social media has a rich sample; however, everyone views the problem from a different perspective, and any topic can therefore evolve into multiple topics.
Topic content, meanwhile, is related to the development stage of the event. During the incubation period, the event was in the early stage of development. Netizens did not know a lot about the event, and their discussions were limited. Few netizens were fully aware of the unrest in Hong Kong. Topics during the incubation period were diverse, focusing on the effect, substance, and problems of the Anti-ELAB Movement. Additionally, there were discussions regarding the reasons for the ELAB, Hong Kong rioters, and the global political environment. During the outbreak period, as demonstrations intensified, many secondary events occurred, including rioters attacking the Hong Kong police, looting and burning, and destroying the national flag. Thus, the topics at this stage involved reactions to the event. During the outbreak period, more netizens became aware of the unrest in Hong Kong, and so the intensity of topic 0 (HK secessionists, aggressive youth) was greater than that of topic 1 (HK, HK secessionists) during the incubation period. At the same time, topic 0 (livelihood, feeling) and topic 2 (violence, campus, education) in the incubation period decreased. Topics during the outbreak period can be summarized as "love the motherland and support the police," accompanied by substantive discussions and event analysis. Topics during the recession period focused on three aspects: "love the motherland and support the police," event analysis, and the effects of demonstrations. During the recession period, topics related to "love the motherland and support the police" increased, the effect of the Anti-ELAB Movement and education increased. However, the intensity was weaker than in the incubation period. This is because the frequency and intensity of demonstrations during the recession period decreased; thus, discussion generated by the event decreased accordingly. At the same time, the spread of topics related to "love the motherland and support the police" on social media involved more netizens in the topic discussions. In the whole process, the low-intensity topics of various periods tended to decrease or merge into new topics.

Spatiotemporal Topic Characteristics
An event-driven spatiotemporal topic hotspot discovery method was used to obtain the spatiotemporal topic hotspots during each period ( Table 2). The number of spatial hotspots in a district during the whole process was counted, and the representative words of the districts were counted from the corpus. Finally, a spatial hotspot distribution map of demonstration topics was created based on frequency and representative words (Figure 7).  There were three dates when hotspots occurred during the incubation period (6.14, 7.1, 7.10). Due to lags in news reports, hotspot dates correspond to events that occurred in the previous 0-3 days. June 14 corresponds to the three demonstrations on June 12, which took place in the Central and Western District (i.e., Admiralty Commercial District, Tim Mei Avenue Central Government Complex in Admiralty, and Admiralty Legislative Council Complex). July 1 corresponds to demonstration that occurred at the Admiralty Legislative Council Complex on July 1. July 10 corresponds to the demonstrations that occurred on July 10 at West Kowloon Station and Tsim Sha Tsui in Yau Tsim Mong District.
During the outbreak period, there were three hotspot dates (7.22, 7.28, 8.14 Regarding the topical spatial hotspots, the representative words of the locations where the hotspots occurred all reflected events or event locations. Event refers to a demonstration event, and location is where it took place. For example, the Islands District's representative words were "aggressive youth," "airport," "gang fighting," and "Hong Kong Police," and the location "airport" correspond to the Hong Kong International Airport. The words "aggressive youth," "gang fighting," and "Hong Kong Police" described the event where Hong Kong police were attacked. The area most closely related to the topic of "demonstrations" was the Central and Western District. The place with the most demonstrations in the Central and Western District was Admiralty. Demonstrations in the Central and Western District all triggered topical discussions. This was because the Central and Western District is mainly occupied by commercial buildings, banks, and government agencies. It is Hong Kong's commercial, financial, and administrative center. From the perspective of time, the hotspot for demonstrations during the incubation period was the Central and Western District. Then, other areas became hotspots for demonstrations, indicating that the demonstrations spread out from the center of the Central and Western District.

Conclusion
This study focused on online public opinion regarding the Anti-ELAB Movement and used topic evolution detection based on the coword network and event-driven spatiotemporal topic hotspot discovery. In this way, this study analyzed the topic evolution characteristics and spatiotemporal topic characteristics of online public opinion. Based on the three-stage topic-content analysis, topic-evolution analysis, and spatiotemporal topic hotspot analysis, the following conclusions can be drawn: 1) In terms of public opinion hotness (e.g., general online public opinion enthusiasm), the large-scale online public opinions discussed in this paper had an incubation period, an outbreak period, and a recession period. In the life cycle of public opinion, public opinion hotness first remained stable, then increased sharply, and finally oscillated downward. The life cycle of large-scale online public opinion had a large time span. 2) In terms of topic content and evolution, the content and topic evolution of topic 2 were both related to the events (i.e., topic content reflected the events). Positive (i.e., patriotic and anti-secessionist) events stimulate positive topics. Therefore, positive events had a positive effect on the topic guidance of large-scale online public opinion. 3) Regarding spatiotemporal topic hotspots, they mainly reflected the spatiotemporal hotspots of secondary events in large-scale online public opinion and the concerns of netizens. There were more spatiotemporal hotspots in the incubation period and the outbreak period than in the recession period. The spatial topic hotspots in the incubation period and the outbreak period were concentrated. The spatial topic hotspots showed a one-core proliferation model with higher concentration in the core and lower concentration in the periphery. Socioeconomic data can be used to further analyze the distribution of spatiotemporal topic hotspots, predictions can be made based on the results, and riotand protest-prevention measures can be taken accordingly. Finally, the locations of positive events can be selected based on the spatiotemporal hotspots.