Corpora-Based Comparative Analysis of Synonyms “Situation, Environment and Circumstance”

This study compares the frequency, collocation and semantic prosody of situation, environment and circumstance in Corpus of Contemporary American English (COCA) and Written English Corpus of Chinese Learners (WECCL). Vocabulary is the cornerstone of second language learning and its development is one of the hot topics in second language acquisition researches. Huge in size, subtle in semantic difference and often presented in minimal contexts, synonyms in English remain a permanent challenge for Chinese English learners. Traditional synonym differentiation usually relies on item-by-item analysis of lexical meanings and introspective qualitative methods like intuition, experience, etc. However, in practice, such guesswork is far from satisfactory. With the rapid development of information technology, corpus linguistics provides a key to the study of synonyms. Based on corpus data, this paper compares and analyzes the selected synonyms using Antconc, Chi-square Calculator and BFSU Collocator. Findings indicate: 1) In terms of use frequency, Chinese students tend to overuse synonyms compared with native speakers; 2) In terms of salient collocations, Chinese students use the synonyms with lots of semantic ambiguity but few collocation types; 3) In terms of semantic prosody, on the Chinese students’ side, inadequate accuracy is comparatively conspicuous as well as semantic prosody misuse. Preliminary cause analysis points to two main factors influencing Chinese students’ mastery of synonyms, respectively L1 negative transfer and the misleading effect of Chinese-English dictionaries. Based on the above analysis, this paper puts forward some suggestions for teaching synonyms and compiling dictionaries.


Introduction
This study compares the frequency, collocation and semantic prosody of situation, environment and circumstance in Corpus of Contemporary American English (COCA) and Written English Corpus of Chinese Learners (WECCL). Mastery of the subtle differences between synonyms is a key standard for English proficiency. Yet it remains a prominent problem for learners of English considering the huge size of synonyms. Traditional means of synonym differentiation, tedious and lacking accuracy, relies on item-by-item analysis of lexical meaning or introspective qualitative methods. But modernized approach emerges with the help of Corpus linguistics, a rapidly developing discipline with information technology and massive data. For synonym analysis, it provides authentic linguistic contexts and a possibility of more objective observation with greater accuracy than the somewhat subjective introspection method.
In this paper, the author observes three synonyms in two corpora, WECCL and COCA, focusing on the following questions with some necessary data analysis methods: What are the similarities and differences in terms of the use frequency, collocation and semantic prosody of "situation, environment, and circumstance" by Chinese English learners and English native speakers?

Synonym Studies Abroad
The rapid development of corpus linguistics drives enthusiastic synonym studies. Corpora prove a powerful tool for researching lexical usages, collocations and so on. With authentic and specific contexts of language use they provide fairly objective language data particularly in terms of synonym differentiation, traditionally an objective guess work.
Corpus linguistics research abroad has experienced 50 years of development, and great achievements have been made in the domain of corpus-based vocabulary study. Since the beginning of the 1960s when N. Francis and H. Kucera began to design and build BROWN, scholars began to conduct large-scale computer-based corpus research, and corpus linguistics gradually developed. Kennedy [1] believed that words and grammar could not be treated separately; that teaching and learning should feature an organically integrated "lexicogrammar", hence the corpus study along the same line. Saeed's study [2] distinguished several sets of synonyms involving conceptual meaning, connotation meaning, emotional meaning and stylistic meaning respectively. He distinguished words press and cupboard in dialects; police, officer, cop and copper, naïve, gullible and ingenuous according to their connotations; wife, spouse, old lady and Missus according to the register. Harward and Etienne [3] observed synonyms in different contexts. They discussed pavement and sidewalk in details under the influence of different dialects (British English and American English), and the formality and contextual style of beauty and pulchritude.
Hoey [4] proposed ten hypotheses on lexical triggering, among which three involved synonyms: 1) any word will be triggered to appear in a specific semantic group, which is its semantic association; 2) the differences between synonyms are mainly reflected in their different collocations with other words, different grammatical and semantic associations; 3) each word will be triggered to appear in a specific discourse relation, which is its textual collocation. Shahzadi, Asghar and Javed [5] applied Sketch Engine (SkE) to analyze the collocation, concordance, word sketches and sketch difference of synonyms arrive and reach in British National Corpus (BNC), and discussed how to effectively teach different functions of synonyms in the naturally generated discourse of corpus. Their study found that there is more meaning associated with reach than with arrive, indicating the widespread occurrence and use of reach. SkE data also showed that the frequency of reach in BNC is much higher than that of arrive. It proves that data analysis based on a naturally generated corpus can be used as an effective strategy to distinguish and teach synonyms arrive and reach.

Domestic Synonym Studies
Lu [6] referred to English corpus FLOB and FROWN in comparison to CLEC corpus with cause, lead to, result in/from in terms of frequency, collocation and semantic prosody for analysis. It was found that learners had difficulties in distinguishing these features. Different groups of learners showed significant differences in their word-use characteristics and acquisition patterns of synonyms, as expressed in varied frequencies of synonym uses and semantic conflicts in the use. Sun [7] studied the pair of synonyms "affect" and "influence" and compared their semantic similarities and differences in the native English corpus FROWN and the learner corpus CLEC. The author focused on three parameters: semantic prosody norm, semantic prosody polarity and semantic prosody strength, and analyzed the pragmatic meaning of the two words in the two corpora.
Literature review suggests the following interpretations. First, there are generally two methods of synonym differentiation by scholars at home and abroad, namely the traditional method and the method based on corpus. Second, corpus-based research generated abundant statistics in terms of quantity. Third, synonyms are usually studied and compared by corpus retrievals involving use frequency, register, collocation and semantic prosody at several levels.
In the meantime, the author makes a categorical search of domestic literature and finds that the research on English synonyms based on corpus covers a comprehensive range of parts of speech, while the majority research focuses on synonymous verbs and adjectives. In addition, the research on synonymous nouns usually only involves characteristics of semantic prosody but seldom investigates different features of collocation. In view of this, corpus, as a new generation of empirical research tool, has not fully played its significant advantages to a large enough extent. Moreover, the author thinks that the research of nouns based on corpus is not deep enough, calling for further research and exploration. Based on this, from the perspective of corpus linguistics, this study intends to discover some significant differences in frequency, collocation and semantic prosody based on corpora of Chinese learners and native speakers by taking a group of synonymous nouns as examples.

Corpus Linguistics
Corpus linguistics emerged in the 1980s as an important breakthrough in research methodology and went through three developmental stages. It updated linguistic description framework and linguistic views [8] and liberated lexical research methods from card making and manual retrieval. The first stage featured the world's first established Corpus of Brown University Standard Corpus of Present Day American English (BROWN) with nearly one million words of British and American English, covering 15 written styles. The second stage was marked by corpus establishment in different parts of the world and symbolized by joint creations of international corpora. In China, the 500,000-word GPEC corpus (Guangzhou Petroleum English Corpus) was built then. Worldwide the BNC Corpus is the best known with a size of about 100 million words in the 1990s. The third stage is characterized with three aspects: large-scale and multi-type corpus; developed corpus processing; universal application of corpus in all fields of language. With vast amount of genuine linguistic data corpus linguistics tries to reveal the complexity of natural language from a new perspective.

Dimensions of Synonym Differentiation in Corpus Linguistics
Kennedy [9] proposed four levels of corpus analysis: (1) Lexical level: to explore a word with view to frequency, the context and adjacent collocation. (2) Syntactic level: to quantitatively analyze various vocabulary combinations and sentence patterns with grammar and part of speech. (3) Discourse structure: to study the coherence and cohesion features of oral and written languages. (4) Disciplinary types: to explore different types of language features to discover emergent standards or sub-standards.
Frequency is an indispensable theoretical concept in the study of lexical collocation in linguistics. Word frequency statistics, both manual and computerized, reveals and analyzes the occurrences of words to get lexical rules. It is often used to study vocabulary regularities, word frequency, teaching and so on.
Another essential concept is collocation, studied for more than 50 years from different angles in different periods. Collocation is the co-occurrence of two or more adjacent words in a text [10]-1 usually described by three important terms: node words, collocation, and span. Calculation of collocations aims at typicality, which usually involves Z-score or T-score of the original number about the node word. The higher the score, the more typical the collocation is.
The third key concept, semantic prosody, is defined in various ways by Sinclair [10]-2, Louw [11], and Stubbs [12]. In a certain context lexical items manifest a strong semantic preference due to the pragmatic purpose. Semantic prosody selection/mechanism strongly restricts the choice of collocation words [13]-1, which could be roughly divided into three categories: positive, neutral and negative prosody [14]. The study of semantic prosody can play an important role in the differentiation of synonyms. And applied to vocabulary teaching, it will effectively supplement the traditional teaching methods.

Corpus of Contemporary American English (COCA)
The Corpus of Contemporary American English (COCA), the largest corpus of American English is freely available online. It contains more than one billion words (20 million words each year 1990-2019) from texts of eight genres: spoken, fiction, popular magazines, newspapers, academic texts, and (with the update in March 2020): TV and Movies subtitles, blogs, and other web pages, each sub-corpus basically balanced. Users can retrieve from the entire corpus or any sub-corpus. A good reference resource for researchers, English teachers and students, it provides a window to inquiry and observation of the uses and changes of American English. Its simple operation interface brings great convenience to researchers.

Corpus of Contemporary American English (COCA)
Written English Corpus of Chinese Learners 2.0 (WECCL2.0), as a substantial sub-corpus of Spoken and Written English Corpus of Chinese Learners 2.0 (SWECCL 2.0), is based on WECCL 1.0 published in 2005 by Foreign Language Teaching and Research Press. Edited by Wen [15] and representing Chinese college students' performance in writing expository essays, WECCL has a great influence on linguistics, it is a database to display the features and essence of interlanguage, which provides insights into second language acquisition and foreign language teaching.

Research Tools
The analysis tools used in this study are Antconc, Chi-square Calculator and BFSU Collocator. AntConc is a retrieval tool with several basic functions such as extracting the contextualized co-occurrence of node words and use frequency. Chi-square (X²) test calculator is used to judge statistically significant differences. BFSU Collocator is a collocation analysis software calculating collocation intensity by MI (mutual information), MI3, Z-score, T-score, Log-log and Log-likelihood.

Frequency
Corpus sizes of WECCL and COCA are 1,248,476 and 1,001,610,938 respectively. Because WECCL features written language, its equivalence is supposed to be the ACADEMIC genre in COCA for the sake of this study, size being 120,988,348.
i. The Frequency Data of Synonyms in WECCL and COCA Usage frequencies of situation, environment and circumstance are listed respectively in Table 1: According to the above data, environment is the word most frequently used by both Chinese students and native speakers. Next comes situation, about 50% (for Chinese learners or 60% for native speakers) the frequency of environment; Circumstance, has a frequency of about 5% (or 3% for native speakers) of environment. Both groups most often use environment, and then situation; circumstance is the least used.
ii. The Normalized Frequency Data of Synonyms in WECCL and COCA Different corpus sizes entail the use of normalized frequency for the sake of comparison. Normalized frequency presents the proportional figure [16] of actual observation (raw) frequency of a certain retrieval item over the total frequency. In a frequency normalization operation, the ratio is standardized by 1 thousand (or 10 thousand, 1 million) to get the average frequency per thousand (or per 10 thousand, 1 million). The frequency standardization of this study is in units of frequency per million, as shown in the following The normalized frequencies indicate that usages of situation, environment and circumstance by Chinese learners are significantly different from those of native speakers. The standard frequencies of native speakers are 162.38, 262.08 and 7.71, which are far less than the standard frequencies of Chinese learners, 440.54, 83.87 and 47.26. Among them, the standard frequency of the word environment in the two corpora is higher than the other two. Moreover, the standard frequency of situation is about 2.7 times that of native speakers, the standard frequency of environment is about 3.3 times that of native speakers, and circumstance has about 6 times frequency that of native speakers'. Why? It calls for further studies.
iii. The Chi-Square Analysis Data of Synonyms in WECCL and COCA In order to determine whether there is a significant difference in the use frequency by native speakers and English learners, Chi-Square test is attempted as in Figure 1. It suggests that scores of situation, environment and circumstance are 578.7789, 1724.0915 and 238.1352, far higher than the critical score of 3.83. Since P-score is far lower than 0.05, we can be 95% confident of the significant differences of the synonym applications in the two corpora. Chinese students tend to overuse situation, environment and circumstance.

Collocation
Analysis of "parts of speech" data is crucial for the study of collocation. Based on COCA and WECCL, this study compares percentages of varied types of parts of speech in collocations of situation, environment and circumstance as shown in the table below: i. The Collocation Data of Synonyms in COCA The study focus is the collocation with adjectives of a given synonym. Collocation strength between the node word and the collocation word is measured through the MI-score (mutual information). When the score is greater than 3, the collocation is considered salient. In COCA's query interface, the symbol span is set as [1/L, 0/R], and the minimum MI-score (mutual information) 3. Retrieved in COCA the salient collocations of situation, environment and circumstance (MI>3) are 51,86 and 29, respectively. 29 is the largest possible number of collocations for circumstance, therefore in this study, 29 salient collocations (MI>3) of each sample synonym are listed for comparison as table 4.  3) of the total salient collocation words; The nouns are: security and employment, accounting for about 6.9% (ibid); the determiner is this, accounting for about 3.4% (ibid). Therefore, ADJ. +situation has the highest proportion.  (see table 3) of the total salient collocations; the nouns are: ERP, classroom, accounting for about 6.9% (ibid); the verb is built, about 3.4% (ibid). Therefore, ADJ. +environment accounts for the highest proportion.   3) of the total salient collocations; the qualifier is this, accounting for about 3.4% (ibid); The article is the, accounting for about 3.4% (ibid); the verb embarrass, 3.4%. Therefore, ADJ. +situation has the highest proportion. The pronoun is our, about 3.4% (ibid). The article is the, accounting for about 3.4% (ibid); the verb, protect, about 3.4% (ibid); the preposition of, about 3.4% (ibid). Therefore, ADJ. +environment accounts for the highest proportion.  (see table 3) of the total salient collocations; The nouns are: academy, ease, restaurants, peace and language, accounting for about 17.2% (ibid). The determiners are this and such, about 6.8% (ibid); the article the, about 3.4% (ibid). Therefore, ADJ.+circumstance accounts for the highest proportion. Corpus-based analysis indicates that different synonyms have their own preferential collocations. The sample synonyms situation, environment and circumstance differ greatly in their salient collocation words in the two corpora. Compared with native speakers of English, Chinese learners use less connections of the Adj.+N type, but they use more types of parts of speech than native speakers.

iii. Comparative Analysis of Synonyms Collocation in COCA and WECCL
Chinese learners tend to use "serious" to describe a situation. However, native speakers tend to use dire situation rather than serious situation. When Chinese learners choose words to modify environment, they are also significantly different from native speakers. For example, in the two corpora, the collocation words "noisy" and "quiet" are both salient. Chinese students prefer to use the expression "quiet environment", while native speakers tend to use "noisy environment". Chinese learners use ADJ. + circumstance and N. + circumstance, which may be caused by L1 negative transfer, while native speakers mainly focus on ADJ. + circumstance.
To sum up, in view of the sample synonyms Chinese students mostly choose general adjectives in the salient collocations, such as good, bad, serious, which seem simple, limited and boring to a certain extent. Native speakers of English, on the other hand, have a wide range of salient collocations with semantically specific content words of various kinds such as dire, fortunate, precarious. These words can vividly express semantic intentions. The lack of semantic specificity and vividness in Chinese learners use of English is most likely due to the limited mastery of vocabulary and idiomatic expressions.

Semantic Prosody
i. The Semantic Prosody Data of Synonyms in COCA According to Sinclair [13]-2, semantic prosody means the trend of semantic preference driven by pragmatic purpose.
The tendency of semantic preference in turn strongly restricts the choice of collocation words, which results in homogeneous items of a limited number of semantic groups. The number of salient collocations (MI>3) (mutual information) of situation, environment and circumstance are 51, 86 and 29 respectively in COCA. The numbers of adjectives are 26, 26 and 28 respectively. The collocation words can be divided into groups of positive, negative and neutral semantic prosody. The above table shows the semantic prosody characteristics of the prominent adjective collocations in COCA of situation. Among them, the percentage of positive semantic prosody including win-win, ideal is 5.4%. Negative semantic prosody, including no-win, stressful etc., accounted for 20%. Neutral semantic prosody included real-life, hypothetical, etc., 74.6%. To sum up, situation shows a neutral semantic prosody feature in COCA. At the same time, the data gives an important message: situation has a strong tendency to combine with neutral or negative words, compared to just 5.4% of positive words that convey agreeable indications. The above table shows that the semantic prosody of adjectives used by native speakers to modify environment includes positive, negative and neutral categories. Among them, the percentage of positive category including healthful, welcoming and so on is 53.0%. Negative semantic prosody, including noisy, harsh accounted for 5.1%. Neutral semantic prosody included indoor, marine, etc., 41.9%. To sum up, environment presents a strong trend of combining with positive semantic prosody features, sometimes with neutral semantic characteristics but very rarely with words of negative semantic indications.  As is seen, native speakers use a variety of adjectives to modify circumstance, common collocation words including special and particular. In addition, the percentage of 89.8%, the highest, highlights the feature of neutral semantic prosody. The positive category percentage is 4.8%, including words like mitigating, fortunate, etc. The lowest percentage is category of negative semantic prosody, only 5.4%, which included blend, unfortunate etc. In sum, circumstance presents a strong trend of combining with neutral semantic prosody features.
ii. The Semantic Prosody Data of Synonyms in WECCL In WECCL 2.0, the collocations of situation, environment and circumstance (MI≥3 and N≥2) were 91,109 and 31, the numbers of adjectives being 26, 25 and 21 respectively. In this study, again 29 (MI≥3 and N≥2) are listed. These words can be divided into groups of positive, negative and neutral semantic prosody. If the percentage of semantic prosody of collocations is greater than 50%, the trend is obvious. In the above table the percentage of neutral semantic prosody is 70.99% mainly involving words like current, present, etc. While adjectives with negative semantic prosody, such as harsh and difficult, which are often paired with situation, account for 28.25%. And only 0.76 percent of the words, with positive indications, are used to describe situation, the most frequent one being "educated". Chinese students use a variety of adjectives to describe environment. Among them, the highest proportion involves positive semantic prosody (73.91%), including words such as protecting, good, etc. The proportion of words with neutral semantic prosody is 25.35%, such as quiet, new, etc., and along the negative, 1.41%, e.g., vulnerable and alluring. Therefore "environment" collocations in WECCL favor positive semantic prosody characteristics. As is seen, adjectives modifying circumstance in Chinese students' language use gather around words with neutral semantic prosody, accounting for 59.26%. Good, warm and other words with positive semantics are in the second group, with a percentage of 29.63%. The lowest percentage was negative semantic prosody, including adverse, exhausted, etc., accounting for 11.11%. Therefore circumstance highlights the feature of neutral semantic prosody in WECCL.
iii. Comparative Analysis of Synonyms Semantic Prosody in COCA and WECCL It can be seen that there are both differences and similarities in the semantic prosody features of the sample synonyms from the two corpora. First, concerning the semantic prosody of situation, both native English speakers and Chinese students "are" neutral. However, the proportions of the two semantic prosody are not the same. Native English speakers are 7 times more likely than Chinese students to use the positive semantic of situation. In view of this, Chinese students may make too little use of the positive semantic intention of situation. Secondly, both native speakers and Chinese students use the word environment positively. However, the proportions of the two semantic prosody are not the same. The proportions of positive and neutral semantic prosody by native English speakers are quite equivalent, while the proportion of positive semantic prosody by Chinese students is about 3 times that of neutral semantic prosody. In view of this, Chinese students might be overusing the positive semantic prosody of environment.
Finally, as far as circumstance is concerned, both native speakers and Chinese students favor the neutral category. However, the proportions of the two semantic prosody are not the same. Chinese students are 6 times more likely than native English speakers to use circumstance's with positive semantic features. Thus, Chinese students might be overusing the positive semantic prosody of circumstance.

Analysis of Research Findings
It is insightful to analyze reasons for the significant differences in the three aspects studied with the sample synonyms: frequency, salient collocation and semantic prosody. As illustrated above, both similarities and significant differences are suggested with COCA and WECCL, and this part attempts to seek causes underlying the differences.
First, L1 negative transfer has an undeniable impact. In terms of the cognitive basis for L2 acquisition, existing L1 knowledge inevitably influences L2 development [17] at levels of vocabulary, syntax or discourse. The second language acquisition process of Chinese students is unavoidably affected by Chinese expressions, cultural characteristics and other factors. What's more, Chinese students are used to thinking in Chinese, while English is rarely activated in the initiating of ideas. And when Chinese students are uncertain about wording in English of a specific concept, they tend to turn to Chinese for help, which results in inaccurate use of English synonyms as well as other transferred expressions.
Second, traditional English vocabulary teaching method is also one of the factors that cause significant differences. Usually, when teaching vocabulary, English teachers stress grammar and key sentences, supplemented by a few typical examples. However, students may get confused about synonyms with non-key sentences neglected in language teaching. To be specific, students do not notice or grasp the particular features of application of words or the issue of semantic prosody in specific contexts.
Third, the somehow misleading effect of English-Chinese dictionaries may also cause significant differences. Guo [18] investigated 485 college students' use of dictionaries in 3 Chinese universities, and found that 44.9% used English-Chinese dictionaries for translation and writing tasks. Thus, as one of the essential reference books for English vocabulary learning, the English-Chinese dictionary plays an indispensable role. However, different English-Chinese dictionaries have different English synonyms listed under the same Chinese heading, some of which may not help students to distinguish and analyze synonyms. Furthermore, most English-Chinese dictionaries lack the information of synonym differentiation and do not explicitly list the differences about collocation or semantic prosody. Therefore, it is very difficult for Chinese students to complete the task of choosing the right words in a specific context.

Implications
Language is like a dress. We vary our dresses to suit the occasion. We don't appear at a friend's silver wedding anniversary in gardening clothes, nor do we go punting on the river in a dinner-jacket [19]. Judging from Potter's words, it is important to choose the right words in different contexts.
Whether written or oral, the proper selection of synonyms is the key to expressing our intentions.
This research brings some implications to teaching. Corpus data provides easily accessible information about real language use [20]. On the one hand, when English teachers discover unconventional interlanguage collocations of words in students' oral or written tasks, they should think about the reasons for such misuses. Therefore, attention should be paid to the explanation of frequency, typical collocation and semantic prosody features in vocabulary teaching. In this way, it is possible for Chinese students to approach idiomatic expressions to the greatest possible extent, reducing misuses and effectively lessening the impact of L1 negative transfer. On the other hand, Chinese college students shall learn to use language corpus independently. Nowadays, most college students are equipped with computers, and the Internet is very convenient. And corpus is a new platform for Chinese students' learning. Many corpus retrieval tools can be used, e.g., BNC (British National Corpus), COCA, LOB (Lancaster-Oslo/Bergen Corpus), etc. By searching the corpus of native speakers, learners can observe the characteristics of native speakers' use of vocabulary, such as typical collocation and semantic prosody features.
In terms of dictionary compilation, a large number of real and effective examples in corpus provide good materials for compilers. The lexical information based on corpus can provide usage characteristics of words in different contexts, hence the possibility for the differentiation of synonyms. The collocation and semantic prosody information obtained based on the analysis of corpus data can provide information for dictionary improvement, beyond definitions of words or mere lists of synonyms, with detailed synonym difference analysis. Therefore corpus-based research provides very useful information for the future development of Chinese-English lexicography.

Conclusion
This paper is based on a comparative study of synonyms "situation, environment, circumstance" in WECCL and COCA from three aspects: frequency, collocation and semantic prosody. The research findings provide some implications. First, in terms of frequency, Chinese students tend to overuse these synonyms compared with native speakers. For instance, where a native speaker might use "linguistic context", a Chinese learner might prefer "situation" instead. Secondly, in terms of salient collocations, comparatively Chinese students prefer synonyms with plenty of semantic ambiguity with few collocation types. Thirdly, in terms of semantic prosody, on the Chinese students' side, inadequate accuracy is conspicuous as well as semantic prosody misuse. Underlying these differences are two possible factors influencing Chinese students' mastery of synonyms, respectively L1 negative transfer and the misleading effect of Chinese-English dictionaries. Finally, data-based analysis enables some suggestions for synonym teaching and dictionary compiling.