A New Stylometry Method Basing on the Numerals Statistic

A new method of statistical analysis of texts is suggested. The frequency distribution of the first significant digits in numerals of connected authorial English-language texts is considered. Benford's law is found to hold approximately for these frequencies with a marked predominance of the digit 1. Deviations from Benford's law are statistically significant author peculiarities that allow, under certain conditions, to consider the problem of authorship and distinguish between texts by different authors. At the end of {1, 2,..., 8, 9} row, the digits distribution is subject to strong fluctuations and thus unrepresentative for our purpose. The approach suggested and the conclusions are backed by the examples of the computer analysis of works by W. M. Thackeray, M. Twain, R. L. Stevenson et al. The results are confirmed on the basis of nonparametric range Mann-Whitney and Kruskal-Wallis tests as well as the parametric Pearson's chi-squared test.


Introduction
Recently, the scope of the practical use of Benford's law [1] has significantly expanded. Known for over a hundred years, Benford's law refers to the probability of occurrence of a certain first significant digit in the distribution of various real life data. Contrary to the common assumption that the frequency of occurrence of any first significant digit should be equal, the digit 1 occurs more likely for many data sets! According to Benford's law, in the decimal system, probability of occurrence of the digit d as the first significant ( ) ( ) 1 lg 1 , accordingly, the probability of 1 d = should be lg 2 0.30 ≈ , the probability of 2 d = -0.18 , etc. An exhaustive explanation of Benford's law, covering all cases of its manifestation, has not yet been proposed, although some conditions favouring its emergence are stated. A classic experiment by Benford, showing a good agreement with (1) -analysis of the occurrence of numerals contained in articles of a randomly selected issue of a magazine -is naturally explained by the theorem by Hill [2], according to which, if one repeatedly randomly chooses a probability distribution and then randomly chooses a number according to that distribution, the resulting data set will obey Benford's law. Note that Benford himself analyzed the occurrence of numerals expressed in figures only.
Incomplete understanding [3] does not preclude the successful use of Benford's law to detecting fraud in accounting and auditing data [4] and election fraud [5]; the applications suggested extend from physics and astronomy [6,7] through seismology [8] to steganography [9] and scientometrics [10].
Zenkov [11] has shown the efficacy of counting frequencies of different first significant digits of numerals for text attribution. It was found that not only for the random combination of heterogeneous texts, but also for the coherent (Russian-language) texts to which the conditions of the aforenamed theorem are not applicable, frequency distribution resembles that of Benford's law (1), but the quota of digit 1 considerably exceeds 30 per cent -at least since the word "one" formally being a numeral can actually play the role of an indefinite article.
In contrast to the traditional methodology of application of Benford's law, which treats deviations from the law as an indication of the possible existence of "falsification" (broadly defined), he placed emphasis on the comparison of these deviations for texts by different authors, showing that these deviations are statistically robust author features that allow to distinguish between texts by different authors (under certain conditions, the most important of which is a sufficiently large text).
Basing on these ideas, we present here new research results concerning the distribution of the first significant digits of numerals contained in coherent English-language texts.
The study is of an empirical and experimental nature. The aim of the theoretical explanation of the results (if at all possible) is not intended which, however, does not diminish the possibility of the practical use of the proposed methodology for practical problems of stylometry.
For all (English-language fiction) texts subjected to computer-aided statistical analysis, we have studied the frequency of occurrence of various first significant digits of numerals, taking into account cardinal as well as ordinal numerals expressed both in figures, and (considerably more often) verbally. In the last case, the first step was to rewrite every form of a numeral with figures (e.g., 'one thousand, seven hundred and eighty-ninth' replaced by '1789') and then to take into account the first significant digit (1) only. To identify the author's use of numerals, we previously deleted from the text all idiomatic expressions and set phrases accidentally containing numerals ('one hand washes the other', 'five-o'clock'), as well as itemizations like 1), 2), 3), etc.
Texts analyzed are mainly taken from the Project Gutenberg website http://www.gutenberg.org

Distribution of First Significant Digits of Numerals in Compound Texts
The conditions of Hill theorem are best satisfied for the compound texts containing the pieces by different authors. In this case, the author peculiarities are averaged, and we obtain a Benford-like frequency-digit dependence but with a steeper drop and the occurrence of 1 much more predominant than prescribed by Benford's law (1).
The Figure 1 shows the results of the analysis of eight English-language compound fiction texts [12][13][14][15][16][17][18][19]. For each collection of stories, we see a monotone decrease of frequency; results for different collections are upon the whole similar, variations may be owing to peculiarities (for example, genre and time of creation) of texts in each collection.

Distribution of First Significant Digits of Numerals in Coherent Texts
Usually, texts belonging to the pen of a distinct author have persistent peculiarities in the statistics of first significant digits of numerals, and their distribution is a stable characteristic of the author.
As an example, we show here the distributions of the first significant digits of numerals in texts by W. M. Thackeray, M. Twain, and R. L. Stevenson (Figures 2-4).   The differences in the statistics of first significant digits of numerals in texts by different authors may be not striking, as in case of novels by sisters Brontë. This is in fact not surprising in view of their common family and education background. The frequency of digit 1 can reach the two times higher value than according to Benford's law (Figures 1-5). It is this digit as well as digits 2 and 3 (to a lesser degree) which determine the author peculiarity of texts in our approach. The occurrence of subsequent digits is subject to strong fluctuations which precludes obtaining useful information from their distribution. In Figures 2-5, the frequency of the digit 1 usually was about 0.5; as it will be shown later, this frequency can strongly differ from that value.
The frequency of digit 1 is, so to speak, a 'fingerprint' which permits to distinguish between different authors if this frequency strongly differs for their texts. How strong should be the difference, to be regarded as significant? We will answer this question at the end of the article.

Jane Austen and Her Imitators
Domestic novels of manners by Jane Austen (1775-1817) caused numerous sequels and prequels. Related topics and even the intention to write in the same way did not prevent the imitators from stark difference in the numerals usage ( Figure 6). Thus, Benfordian analysis can be useful in the study of text's authorship.

Authorship of the 15th Book of Oz
Lyman Frank Baum, a prolific writer whose "Wonderful Wizard of Oz" was a great success, wrote until his death 13 sequels of this book. The series was so popular that the publishers decided to continue it. The 15 th book, 'The Royal Book of Oz', published after Baum's death, was written "by L. Frank Baum,…, Enlarged and Edited by Ruth Plumly Thompson" as noted on the title page of the first edition (1921). Subsequently, the point of view has spread (argued by linguistic and statistical means) that Thompson did not base the story on any notes Baum left behind, thus "The Royal Book of Oz" was entirely her own work [20]. This opinion is now generally accepted.
Although this particular philological question has already been solved, we will show the results of applying our methodology.
Below are the results of the statistical study of Baum's books as well as sequels by Thompson and by other authors (Figures 7-9). Note a dramatic difference in the occurrence of significant digit 1 in Baum's texts, on the one hand, and in texts by Thompson (in particular, in "The Royal Book of Oz"), on the other hand. In view of the length of the texts analyzed, this striking difference can hardly be explained by random fluctuations (unlike subsequent significant digits, which even in the books by the same author behave differently); it demonstrates the authorship of Thompson. Besides Thompson, many other writers created sequels for "Wonderful Wizard of Oz". Again, the common theme did not cause the similar distributions ( Figure 9). We are prone to regard this difference as a characteristic of the author's style. We tend to associate it with the psychological peculiarities that, regardless of the will and intention of the author, influence his texts. Thus, the statistical method based on counting the first significant digits of numerals, is able to answer the question about the text authorship.

Testing of Methodology: Harper Lee and Truman Capote
Harper Lee's "To Kill a Mockingbird", published in 1960, is considered one of the greatest novels of American literature. In 2015, short before her death, another novel, "Go Set a Watchman", was published. Initially promoted by its publisher as a sequel, it is now widely accepted as a first draft of her famous novel.
Truman Capote was a lifelong friend of Harper Lee. One of the characters in "To Kill a Mockingbird" was based on him. In contrast to Lee who in fact is the author of a sole book, he was much more prolific, and many of his works are recognized literary classics. The speculation eventually grew that Capote ghosted Lee's book.
Testing this hypothesis is an interesting application of the idea about the relation of text authorship to its statistical characteristics.
We have counted the frequencies of various first significant digits of numerals in novels by Harper Lee and Truman Capote (Figure 10). Results of the analysis are unexpected: properties of the novel "To Kill a Mockingbird" are far from those of Capote's texts, but the primary draft, "Go Set a Watchman", is close to them. It seems that Capote could help Harper Lee in writing the primary text. After having gained experience, she seems to have written her famous novel by herself. We believe that our methodology can be a useful addition to traditional textual practices, taking into account sentence length, word length, occurrence of certain words and parts of speech, etc. [21,22,23].

Conclusion
Benford's law holds approximately for coherent texts. Deviations from Benford's law are statistically significant author features that allow, under certain conditions (the most important of which is a sufficient length), to distinguish between the texts with a different authorship.
The actual frequency of occurrence usually is higher than the probability according to Benford's law for significant digits 1, 2, 3; for the subsequent digits the situation is reversed. At the end of {1, 2,…, 8, 9} row, the digits distribution is characterized by strong fluctuations and thus is unrepresentative for our purpose.
Of course, the comparison of the distributions cannot be based merely on the detection of their subjective visual similarities/differences. To quantify, we have applied the non-parametric range Mann-Whitney U test and Kruskal-Wallis test as well as the parametric Pearson's chi-squared test. The null hypothesis, which asserts the absence of significant differences in the distributions considered, was rejected and accepted exactly in the cases, as described above, i.e. the visual assessment was correct.