Kognitor: Big Data Real-Time Reasoning and Probabilistic Programming

There is a huge increase in the amount of generated data since the explosion of the Internet. This generated data which is usually collected in different formats and from multiple sources is popularly termed Big Data. Big data contains uncertainty. To handle uncertainty in big data, probabilistic reasoning is used to develop probabilistic models that specify generic knowledge in different topics. These models are used in conjunction with an inference algorithm to enable decision makers especially during uncertain situations. Extensive knowledge in fields such as statistics, machine learning and probability theories are employed in the development of these probabilistic models. Thus, it is usually a difficult undertaking. Probabilistic programming was introduced to simplify and enable development of complex models. Again, decision makers often need to use knowledge from historic data as well as current data to make cogent decisions. Thus, the necessity to unify processing of historic and real-time data with low latency. The Lambda architecture was introduced for this purpose. This paper presents a framework called Kognitor that simplifies the design and development of difficult models using probabilistic programming and Lambda architecture. Evaluation of this framework is also presented in this paper using a case study to highlight the crucial potential of probabilistic programming to achieve simplification of model development and enable real-time reasoning on big data. Thus, demonstrating the effectiveness of the framework. Finally, results of this evaluation are presented in this paper. The Kognitor framework can be used to steer effective and easier implementation of complicated real-life situations as probabilistic models. This will be beneficial in the big data processing domain and for decision makers. Kognitor ensures cost-effectiveness using contemporary big data tools and technology on commodity hardware. Kognitor framework will also be beneficial in academia with respect to the use of probabilistic programming.


Introduction
Planning, analysis and/or calculation generate facts, that are usually not organized. A collection of these facts can be referred to as data. Huge amount of data is produced and amassed, sometimes as a secondary product from the activities and processes of entities and individuals [1]. This huge data is termed big data. There are three popular qualities of big data: variety, high velocity, and high volume, also known as the 3Vs [2]. The 3Vs have formed the foundational definition of big data.
Data is generated from multiple, unidentical sources with distinct level of uniformity [3]. This disparity in uniformity introduces noisy data that must be efficiently managed using novel techniques such as artificial intelligence and machine learning [4,5]. To address uncertainty in big data, the machine learning research community introduced probabilistic reasoning. Probabilistic reasoning uses probability theory with deductive logic to enable formal reasoning especially in varying conditions [6][7][8][9]. Probabilistic reasoning is also used to interpret complicated situations to make decision making easier [10][11][12].
The process of decision making is now automated. Automated systems that aid in decision making are generally called probabilistic reasoning systems [13]. According to [14], probabilistic reasoning systems consist of probabilistic models and inferences algorithms. A probabilistic model is an all-inclusive generic information and components of a domain encoded in probability theories such as Bayesian network [15], hidden Markov models [16,17] and stochastic grammar [18][19][20][21]. Probabilistic models as well as evidence are used by inference algorithms to produce probabilistic score as a response to queries. This procedure is known as probabilistic inference [14]. Designing a probabilistic model is a vigorous task that requires extreme technical expertise in fields such as natural language, mathematics, algorithms [8,11,22]. Again, many real-life situations are too expensive to model [14,[23][24][25][26]. In response to these issues, probabilistic programming emerged through efforts from the machine learning and programming language communities [24,27].
According to [14,24,26], probabilistic programming is a relatively recent idea. Nevertheless, [28] advocates the capabilities of probabilistic programming in artificial intelligent systems. This paper presents a framework that demonstrates the effectiveness of probabilistic programming in the development of big data processing systems that uses complex probabilistic models. This paper also showcases the constructive integration of off the shelf, open-source contemporary tools and technology for big data on commodity hardware to realize Lambda architecture.
The rest of the paper is organized as follows. Section 2 presents the background knowledge around processing of big data, Lambda architecture, probabilistic reasoning, and probabilistic programming. In Section 3, similar projects/research is presented. Section 4 describes the Kognitor framework. Section 5 demonstrates an implementation of the framework using a case study. An evaluation of the framework is provided in Section 6. This paper ends with a conclusion presented in Section 7.

Background
Novel tools and techniques are required for the effective management and analysis of big data. Early technologies developed to analyze big data were mainly geared towards batch processing [29]. The majority of these batch processing tools used the MapReduce framework designed by Google [30]. A popular example of the batch processing big data tool implemented using MapReduce is Hadoop. Hadoop became widely accepted and extensively used in academia and industry [31][32][33][34][35]. Although MapReduce and Hadoop presented advantages in big data processing, they were unsuitable for processing high speed big data that requires low latency [36][37][38][39][40]. Thus, the need for big data stream processing.
Stream processing sometimes referred as real-time processing deals with the velocity attribute of big data. Stream processing handles small pieces of data, thereby enabling low latency [29,[41][42][43]. Some examples of open-source stream processing systems are Apache Storm [44], Apache Spark [45] and SQLstream [46].
Decision makers require the processing of both static (high volume) and real-time data together for well informed decision making [47,48]. Thus, stream processing or batch processing tools in isolation may not be the remedy in real-life situations. The need for solutions that support batch and real-time big data processing is apparent [49][50][51]. Solutions that support this combination are called hybrid computation. Some examples are Kappa architecture [50], the Liquid architecture [52] and Lambda architecture [49].

Lambda Architecture
Lambda architecture was designed and proposed by [49] as a hybrid solution to big data issues. This architecture is made up of three layers -the batch layer, the serving layer, and the speed layer. Each of the three layers is responsible for a unique problem in big data. The functionalities of these layers are built on each other. The batch layer is the central part of Lambda architecture. Raw or unprocessed data is permanently stored in the batch layer. The unprocessed data in the batch layer is periodically processed using a batch processing framework to yield batch views. To balance the high latency batch processing, the speed layer employs incremental model to achieve real-time processing. Result from the processing in the speed layer is known as real-time views. As soon as processing on the same dataset is done in the batch layer, the corresponding real-time views are consequently discarded. The serving layer uses both the batch views and real-time views to provide low latency response to user queries [53][54][55][56][57][58].

Probabilistic Reasoning
The process of decision making is sometimes straightforward, but in other cases, decision making may require complicated procedures which involves evidence from many sources [59]. This is clearly seen in uncertain circumstances where the odds of uncertain events influences decision making [13,60]. The eventuality of an event is represented using probability. Thus, probabilistic reasoning simplifies decision making using underlying principles or knowledge and probability. Probabilistic reasoning is an integration of what holds true about a circumstance with the laws of probability [14].
Probabilistic reasoning systems are applications that automate the process of probabilistic reasoning. A probabilistic reasoning system is typically made up of an inference algorithm and a probabilistic model [14]. Probabilistic reasoning systems are useful for prediction, deduction, and improvement of general knowledge in a domain.

Probabilistic Programming
Due to the difficulty and complexity associated with modelling real-life scenarios as probabilistic models, the concept of probabilistic programming was introduced.

Related Work
It is believed that probabilistic programming would ease the process of designing complex probabilistic models. Research conducted by [67] to examine or measure the adoption of probabilistic programming in big data processing resulted in the identification of a solution called InferSpark [68]. At the time of publication, InferSpark claimed to be the only solution that uses probabilistic programming to provide efficient statistical inference on big data. The authors of InferSpark also recognized the potential of probabilistic programing in the development of complex probabilistic models, while pointing out drawbacks of contemporary probabilistic programming systems.
An evaluation of InferSpark was provided. According to [68], InferSpark outperformed MLib, and Infer.NET. However, InferSpark implemented only one inference algorithm called Variation Messaging Passing (VMP). The framework presented in this paper improves on this using a probabilistic programming system called Figaro. Figaro allows the design of a probabilistic model using either Bayesian network, Markov models, or a combination of both. In addition, the Kognitor framework demonstrates and achieves low-latency computation by implementing Lambda architecture.

System Architecture
A major motivation for the Kognitor framework is the need to support real-time decision making in uncertain situations. Kognitor uses of probabilistic programming to enable easier development of real-life complex models, and Lambda architecture to achieve real-time big data processing with low latency. This framework also achieves cost-effective data processing using a combination of contemporary off-the-shelf tools and technologies on commodity hardware. Kognitor is made up of three components namely feeder, server, and storage. The three layers of Lambda architecture are implemented in the three components of Kognitor framework.

Feeder Component
This component is responsible for data ingestion into Kognitor. The flow of data into Kognitor is managed by the feeder component. Data from multiple sources can be aggregated in the feeder component. The feeder component also cleans up and filters unnecessary and unrelated data before persisting in the storage component.

Storage Component
The storage component houses data used by Kognitor framework. This component is made up of the master, pseudo-master, batch-view, and realtime-view databases. Each of these databases is responsible for a unique storage need of Kognitor.
The master database is responsible for storing immutable, continuously expanding data. Thus, it should support batch reads and random writes. This forms an implementation of the batch layer as proposed by Lambda architecture. The pseudo-master database holds data as it arrives from the feeder component.
The batch-view and realtime-view databases are used to store result of data processing done on the master and pseudo-master databases, respectively.
In accordance with the Lambda architecture, the master database actualizes an implementation of the batch layer, the batch-view database actualizes part of the serving layer while the pseudo-master and realtime-view databases implement part of the speed layer.

Server Component
The central component of the Kognitor framework is the server component. All data processing is done by the server component. The server component is sub-divided into two modules: the batch module and the real-time module.
The batch module performs computation on the data stored in the master database. This computation happens at a set time interval. On the other hand, as soon as data is available in the pseudo-master database, the real-time module performs computation on data.
The batch module implements part of the batch layer of Lambda architecture, while the real-time module implements part of the speed layer.
It is important to note that there are two types of computations done in the server component. The first is the learning computation. Kognitor uses an algorithm to learn from the data stored in both the master database and the pseudo-master database. Results from the learning computation done on master database are stored in the batch-view database while results from learning computation on pseudo-master data are stored in the realtime-view database.
The second type of computation is the reasoning computation. Kognitor uses an inference algorithm alongside data from the batch-view and realtime-view databases to perform reasoning computation. Figure 1 shows all the components in Kognitor framework.

Case Study
To show the effectiveness of Kognitor framework, an application called K4F was developed using Kognitor framework. K4F predicts the outcome of a football match. In this case study, two football teams were selected from the English Premiership League.

Feeder Implementation
Akka [69,70] was used to implement the feeder component of Kognitor in K4F. A mock repository was used as a source of data for K4F. An Akka actor was implemented to act as a pipeline between the data repository and K4F. Another Akka actor was implemented to persist the data from the pipeline into the master and pseudo-master databases.

Storage Implementation
The storage component of Kognitor was implemented using Apache Cassandra [71] in K4F. In the master database, four tables were created to handle the storage need of K4F. The tables were team, rating, form, and fixture. The pseudo-master database also consists of same tables as in the master database. Tables were also created in the batch-view and realtime-view database to hold results of computations by the server component.

Server Implementation
In K4F, the server component was implemented using Figaro. Figaro represents probabilistic models using elements (variables), relationships between these elements, the functional parameters of the element relationships, and the numerical form of the functional parameters. In this case study, four variables were chosen to represent an indication of a win in a football match. The chosen elements are: a. Has Good Rating: A Boolean variable dependent on a team's rating. The rating can take a value between 0 and 10, 10 being the highest (best) rating. b. Has Good Form: A Boolean variable dependent on a team's performance in their last six (6) games. c. Has Home Ground Advantage: A Boolean element dependent on a team's performance when in their home ground. d. Is Winner: A Boolean element indicating the possibility of a win. The relationship between the chosen elements is shown in Figure 2. Next, the functional form of the dependencies is determined. In Figaro, the variable class constructors are used to express functional forms. According to Figure 2, the is Winner element is dependent on other elements, thus its functional form is: Flip is a construct used in Figaro to denote a Boolean value, and is the probability of a win by a football team. Has Good Rating is defined as: represents bad rating probability, represents good rating probability, and represents a win probability. The functional form of has Good Form is: ! is the probability of a team's bad form, " is the probability of a team's good form, and is the probability of a win.
Has Home Ground Advantage is defined as: Has Home Ground Advantage=(δ→ϑ ∧ ) (¬δ→θ) (11) $ is the home ground loss probability, % is the home ground win probability, and is the probability of a win.
The elements, their relationships, the functional form of the relationships and the numerical parameters together form a complete Figaro model for K4F.
This case study uses the expectation maximization (EM) learning algorithm and the variable elimination inference algorithm. Both algorithms are provided by Figaro.

Evaluation
It is necessary to evaluate an artefact thus providing insight on the effectiveness and quality of the artefact [72,73]. K4F was evaluated using experimental method.
This experiment used Manchester United and Chelsea EPL teams. There previous games for 2017/2018 and 2018/2019 season were used in this experiment.

Learning Computation Results
Learning computation was carried out three (3) times on the batch module of the server component, corresponding to the intake of data. On the real-time module, learning was done as many times as new data was ingested into K4F. Learning computation was repeated at least five (5) times on both batch and real-time module to access duration. On the first ingestion of data, learning on both batch and real-time module took approximately 1.2 seconds (See Table  1). Subsequently, as the size of data in the master database increases, the time to complete learning on the batch module also increased (See Tables 2 and 3). However, learning time on the real-time module remained in the same neighborhood.

Reasoning Computation Results
In K4F, a reasoning computation request is on the is Winner variable. K4F exposes reasoning on the batch module, server module and a combination of both. Table 4 shows the reasoning times in seconds.

Conclusion
This paper presents a framework called Kognitor that proposes the adoption of probabilistic programming in big data processing. Kognitor also enables cost-effective and low latency data processing using Lambda architecture.
This paper started with a discussion on the background knowledge around big data processing and an analysis of related works. Then, the introduction of the framework, as well as an implementation to showcase effectiveness. Evaluation of Kognitor was presented using experimental method on a case study (K4F). Performance result from this evaluation shows low latency in data computation.
The aim of this paper is on probabilistic programming in big data computation. Thus, less effort was directed toward other components such as UX. This may constitute part of a future work. Another area for future work would be further evaluation of Kognitor framework using other evaluation methods.