Survey of Big Data Storage Technology
Wang Weichen, Gao Jing, Cao Rui
College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
To cite this article:
Wang Weichen, Gao Jing, Cao Rui. Survey of Big Data Storage Technology. Internet of Things and Cloud Computing. Vol. 4, No. 3, 2016, pp. 28-33. doi: 10.11648/j.iotcc.20160403.13
Received: April 27, 2016; Accepted: June 4, 2016; Published: June 21, 2016
Abstract: Big data storage is the foundation of big data processing and analysis. By researching and summarizing main processing technology of data storage, this paper respectively investigates and analyzes the following four aspects: distributed file system, NoSQL database, database appliance and new-type data storage technology of MPP architecture. In addition, this paper gives some recommendations applicable to different environments in favor of grasping the development states of data storage technology from different angles. This paper summarizes file segmentation, appropriate scenarios and merits and faults of distributed file system, and mainly analyzes and summarizes the theories and appropriate scenarios of four data storage models of NoSql database. Furthermore, this paper investigates and concludes the developments and features of database appliance minutely. At the same time, outline MPP (Massively Parallel Processing) architecture, a new data storage technology. At last, the research trends of storage technology are prospected, providing references to the research of big data storage technology.
Keywords: Big Data Storage, NoSql, Distributed File System, Database All-in-One Machine, MPP Architecture
In the past few decades, with the expansion of 1application scale, web services evolved from a single form into multimedia form, leading to diverse data structures and forms, and exponential data growth. International Data Corporation (IDC) predicts that in future data size will double in every two years . The pioneer of big data research—McKinsey & Company, a consulting company in the United States, defines big data as: the data set whose scale is beyond the ability of acquisition, storage, management, and analysis of conventional database tools . The traditional data storage system has reached a bottleneck, and cannot finish data processing in time. Big data has such features as high capacity, various data types, low value density, high processing speed, complex dynamic relation between data. And the requirement of high availability, scalability and reliability , poses challenges to traditional data storage technology.
This paper researches and summarizes new kinds of data storage technologies, for the problems of application scale expansion, rapid data growth, multiple data types. First, investigate and analyze the research status of distributed file system. Besides, analyze and compare the characteristics of main distributed file systems, such as GFS , HDFS , GlusterFS , GridFS , TFS , Lustre , FastDFS . In addition, by researching four data models of NoSql database, respectively contrast the storage technology of key value model, column type model, document model and graphic model. And research the applicability and architectural feature of current big data all-in-one machine. At last, introduce the definition and scenarios of new kinds of database cluster using MPP (Massively Parallel Processing)  framework, and forecast the probable research trends of storage technology in the future.
2. Big Data Storage Technology
This paper investigates and analyzes big data storage technology from the following four aspects: distributed file system, NoSQL database, new-type data storage technology of MPP architecture and database all-in-one machine. In addition, this paper gives some recommendations applicable to different environments in favor of grasping the development states of data storage technology from different angles.
2.1. Distributed File System
File system is the basic of application program. However, with the development of network application, data grows rapidly. So big data storage technology has become the main task of enterprises and research institutions. Because the storage capacity is limited, traditional storage systems can hardly solve the problem of big data storage. So we use distributed file system to transfer system load to multiple nodes. Distributed file system provides polymeric storage capacity and I/O bandwidth, so that the scale of the system can be easily extended .
Generally, determining whether a distributed file system is successful depends on the following three factors: data storage mode, reading rate, security mechanism. However, there are still improvements to be made in distributed file system. For example, GFS and HDFS designed for large files cannot meet the storage requirements of many small files. Small file’s access frequency is high, leading to high frequency to access the hard disk. So the performance of I/O has been reduced . Small files can also lead to the production of a large number of metadatas, which will affect the metadata server management and restorability, and then result in the decline of total performance. What’s more, since the file is relatively small, it is prone to creating file fragmentation, wasting disk space. Creating links for each file will cause network delay .
Table 1 summarizes and contrasts several common distributed file systems broadly, such as GFS, HDFS, TFS, Lustre, etc.
|Name||File Segmentation||System Backup||Merit||Demerit||Application scenarios|
|GFS||Files stored in GFS are divided into fixed-size blocks||Each block is copied to multiple chunk servers, saving 3 copies in default.||GFS will not regard hardware failures as abnormal. Usually an update is processed by adding new data rather than changing the existing data.||Do not apply to the small file storage. Extra small files will degrade performance.||Large distributed massive data set. Data size is generally among 4G~40G.|
|HDFS||Large files are divided into some blocks whose default size is 64MB. Each block will store some copies over more than one data node.||Support data replication. Store multiple copies over different nodes.||Expansibility is very strong. A single HDFS instance can support tens of millions of documents. And has high real time capability.||Cannot be used to the scenarios that requests low latency in data access. Cannot store large-scale small files.||Very large data sets whose size is at GB to TB level.|
|GlusterFS||Do not support file segmentation||Support data replication and provide the global namespace. Multiple copies of multiple files can be stored in different hosts. While reading a copy, the system will choose the closest copy acquiescently.||Support CIFS, NFS and native machine using GlusterFS clients. Multi-file system can be deployed on virtual distributed file system. Use the admin console, managing the central update easily. All the nodes can be used to get data.||Can manage system only on one server, with no redundancy. Cannot add a new node halfway. Don't know how to add multiple disks to each node; There is no security policy in GlusterFS.||Support massive large file storage of PB level.|
|TFS||Do not support file segmentation. A large number of small files will be merged into one large file.||TFS stores data files in blocks, and store multiple copies in case of data security.||The operation is simple, with smooth expansion and load balancing. Support linear scaling. Can easily extend to PB level.||When concurrency is high and file size is more than 5 MB, severe bugs arise in TFS. In rare cases, support large file storage. Do not support catalog and user permission.||The storage and processing of vast amounts of unstructured data, and massive images on taobao website.|
|GridFS||Support dividing a large file into multiple small document files.||GridFS stores file data and file metadata in MongoDB. Copy files to cope with failover, and data integration. And can also be used to read extension, hot backup or be used as data sources of offline batch processing.||It is based on the structural pattern of object storage, reducing the access overhead furtherly during the runtime. GridFS is designed to regulate access mechanism, to adapt to faster application performance put forward by I/O mode. Can provide the fastest I/O performance that application cluster needs. GridFS ensures the load balance in each storage devices, rather than filling a single node.||The speed of reading a file from the GridFS is slower than reading from the file system directly. If the file is large and is stored as multiple file, can't lock all the file blocks, when modifying this large file. When changing the documents stored in the GridFS, it can only remove the old documents first and then re-save the documents.||Suitable for large files that seldom need to get changed.|
|Lustre||Divide data into a fixed number of objects. Each object contains several data blocks. When one data block written to the object exceeds its capacity, next writing will be stored in next object. Lustre can distribute file into 160 objects at most to store in.||Lustre provides two backup tools. One is used to scan file system, and another one is used to package backup and pressure recovery.||Can provide the ability of data sharing and parallel processing. The scalability is very strong. Can provide failover technology for metadata and target data under lustre management, achieving access with high reliability. The distributed management mechanism can achieve concurrency control. Provide access in multiple networking protocols.||It is difficult to implement data mirroring. The failover between nodes relies on third-party heartbeat technology. There are only two metadata management nodes. If the system size has achieved certain scale, the management node will reach overload. Lustre kernel can only be deployed on Linux, with some limitations.||Support massive large file storage of PB level, and is suitable for large computer cluster or supercomputer.|
|Ceph||Adopt RAIDO pattern to across multiple hard disks. Disperse continuous data on multiple disks to access, to adapt to the load balancing.||Support data replication. There are multiple metadata servers.||Store data and metadata separately. Manage matadata using dynamic distribution. Has the reliable automatic distributed object storage.||Technology is not mature, may not be applied to the production environment.||Support massive large file storage of PB level.|
|FastDFS||FastDFS does not store files in blocks. Files uploaded by clients are corresponding to the files stored on the server.||FastDFS adopts the storage mode of grouping. The storage servers within the same group backup each other.||FastDFS server has only two characters, tracker and storage nodes. So it has the feature of lightweight. FastDFS adopts the storage mode of grouping, flexible and strongly controlled. In FastDFS, each node is primary node, with peer-to-peer structure. Can change the number of trackers at any moment, according to the pressure of the server.||FastDFS does not store files in blocks. So it is not suitable for distributed computing scenario. Storage capacity is limited by a single storage server.||Is suitable for the high traffic service, with file as the carrier, such as photo album website, video website, etc. And vast amounts of small files, 4K~500M.|
2.2. NOSQL Database
With the rapid growth of data size of enterprise users and enhancement of user demand for service level, the traditional relational database has some limitations. Traditional database use flat file based on structured record to store all application data, leading to mismatching between application and database. This happens when application is coded in a declarative language. Its structure is completely different with these databases . At past, in order to improve the performance of the system, the components and resources are extended vertically. However, since the storage and application are no longer separate, each expansion of the resources will be service disruption and applications reset.
Most data we create is heterogeneous data. The existence of a large number of structured and unstructured data makes it difficult to determine the perfect and unified relational data model in advance, and the horizontal expansion ability of relational database is bad . Most relational databases do not support large-scale distributed storage. At the same time, it is hard to meet the real-time requirement of high concurrency and large amount of data. So the underlying storage technology should not only be flexible to allow the data to be stored in its natural form, but also meet the demand of the frontier.
Compared with relational database, NoSQL database storage system supports the storage and dynamic management of mass data. It avoids the unnecessary complexity, has high throughput, and can handle horizontal scaling well. And high fault-tolerant can store structured, semi-structured, and unstructured data to avoid the object-relational mapping. The design idea of NoSQL database is to extract the indexing mechanism of relational database, combine distributed storage strategy, and delete those needless on some problems in the SQL system. Therefore it achieved relative good efficiency, expansibility and flexibility .
Nosql database is mainly divided into: key-value storage, column-based storage, document storage, the graphics storage.
Key-value database is designed to support simple query operations, leaving complex operations to application layer. Data set will map the key to one or a set of values. That is, the key is the only keyword to find each data address, which also means it is indispensable. The value is the content that data actually store. Key-value storage provides a hash table with key-value pairs on remote servers of a distributed cluster, to implement the mapping from key to value. The hash value based on the key locates the address of data directly, achieving rapid high concurrency query, and also supports the operation of mass data. Key-value storage are divided into key-value type, key-document type and key-column type . Key-column type is the typical expansion of key-value pairs of key-value type. Because of its simpleness and flexible extensibility, it is also the mainstream of data model.
Generalized column-based storage replace columns with column family. The idea of relational database is to store all tables with a line on the disk. That is, a list of entries associated with the same specific row id will be stored together . Since banks or financial institutions need to maintain a large number of related records, do not guarantee that all values are always stored in a continuous way. In the database of column data, a whole column of table is stored together, mapped to a key. Because all listed items have indexes, we can only search part of the table. A column can also have nested columns of hierarchical structure, and one of them is super column . This provides simple query and quick access, at the same time avoids unnecessary overhead of looking for the single key of a record.
Graph database fits traversal and application search best, such as finding related links on LinkedIn, finding friends on Facebook , etc. It pays more attention to the relationship between data items rather than the data itself. They highly optimize rapid traversal and use graph algorithm efficiently. For example, the shortest path is first in order to find the relevance between information, etc.
Table 2 analyzes and concludes main storage types of NoSQL database:
|Database type||merit||demerit||data model||application scenarios||instances|
|Key-Value Storage||Has very high concurrent reading and writing performance. Data is indexed and segmented according to the key value. Search is rapid, and data model is simple.||Data has no structure, and do not support complex logic data operation.||key-value mapping between key and value||Content cache. Mainly used for the log system.||Dynamo, Redis, Voldemort|
|column-based storage||Search is rapid, expansibility is good, and save a lot of I/O operation. It is easier for distributed extension.||The function is relatively limited.||Column-based storage, where data in the same column is stored on the same page||Distributed file system||Bigtable, Cassandra, HBase HyperTable|
|document storage||Don't need to define the data structure in advance. Use document of specific format instead of tuple as the unit of data storage.||The query efficiency is not high, and lack of unified query syntax.||The value points to the structured data.||Web application||CouchDB, MongoDB XML Database ThruDB|
|graphics storage||Use graph theory and associated algorithm to improve the storage performance, management and operational data.||The function is relatively limited.||Graph structure||Social networking, relationship graph||Neo4j, GraphDB InfoGrdi|
2.3. Database All-in-One Machine
In recent years, facing mass data processing and storage, many traditional hardware manufacturers propose the integrated solution---database all-in-one machine, which has become a hotspot. By the product form of all-in-one machine, it simplifies the complexity of deploying and managing the infrastructure of data center, solving the problem of continuous expanding of basic hardware resources at the age of big data, the requirement of all-in-one machine, and the storage cost of mass data. International manufacturers, such as IBM, Oracle, EMC, launch integration products and solutions for big data . Following them, Chinese manufacturers also develop its own database all-in-one machine. For example, database all-in-one machine of Huawei makes use of its hardware architecture advantage of computing, storage and network convergence, and the feature of high throughput and high IOPS, integrates excellent characteristics of intelligent network card, SSD and other hardware, solving the performance bottleneck between computing and storage. XData big data processor of Shuguang separates the data storage unit and processing unit. By building efficient services middleware, polymerize the underlying data storage node adopting shared-nothing structure into a single data processing system image. Langchao clouds big data all-in-one machine covers technical sessions such as data storage, data processing, data presentation, etc. And there is also Yunchuang Storage data cube cloud computing all-in-one machine, Zhongzhiheda big data all-in-one machine, Zhiyitu Hadoop-Based big data all-in-one machine .
Database all-in-one machine is generally suitable for data model of complex storage relations. At the same time, computing needs high transactionality and consistency. Generally speaking, the database engine server configuration depends on the concurrency demands, and the database storage nodes server configuration depends on the data size demands . Database all-in-one machine adopt fully distributed big data processing architecture, integrating the hardware and software in a system. With the growth of user data and the expansion of business, it can be improved by extending hardware lengthways, and can also achieve linear scaling by adding nodes breadthwise, guaranteeing the performance of low latency, high throughput and the continuity of the business . All-in-one machine is a combination of software and hardware, wholly designed for mass data storage processing. And it is made up of a set of integrated servers, storage devices, operating systems, database management system and pre-installed software for data management. It provides big data storage solution, mainly for large data warehouse market. And its high throughput capability facilitates solving I/O bottleneck problem. The user can choose different series of products according to the requirements, customizing on-demand.
However, database all-in-one machine also faces challenges. In the era of big data, the amount of data is increasing shockingly. Therefore if the users need to expand all-in-one machine, they can only add a equipment cabinet, leading to inflexible expansion. And because the all-in-one software is highly integrated, it is hard to be deployed in other environments.
In some industries, demand changes more quickly. So business model will change very quickly with it. Using all-in-one machine will limit the action of enterprises on the contrary. But in some relatively mature and stable application, all-in-one machine embodies the value of simplifying IT.
2.4. New Database Cluster of MPP Architecture
MPP is large-scale parallel processing system, which is a kind of method for system resource extension, mainly of parallel processing. This means that a single computer has multiple network processors .
Horizontal expansion is the primary design goals of MPP architecture database. It is linked by multiple SMP servers via internet of fixed nodes, collaborating for common tasks. From users’ level, it is a server system, supporting strict data relation model. The biggest characteristic is that each node can only access their own local resources, with no sharing.
Database cluster using MPP architecture can effectively support mass structured data storage of PB level. It is based on the Shared Nothing architecture. By big data processing technology of column storage, coarse-grained indexes, etc., and combining its distributed computing model with high performance, it completes technical support of the storage application of analysis class. Operating environment is mostly low-cost PC, and it has the advantages of high performance and high scalability . It can improve the performance of data processing, improve the data process load, improve the efficiency of mass data processing and reduce the overall cost of processing each TB. Therefore it has been widely used in enterprise data warehouse of new generation and structured data analysis field.
3. The Prospect of Bigdata Storage Technology
Due to the large amount of unstructured and semi-structured data, traditional relational database has been powerless. However, new storage technology, such as the NoSQL database and distributed file system, is superior to traditional storage, no matter in fault tolerance, scalability and mobility of data. And it is suitable for persistent data storage and mass data storage management . But for the real-time performance of data processing, there is a certain gap between the new storage technology and relational database. So each has its good side. At present, the combination of relational database and distributed parallel processing system can improve storage efficiency, processing speed and analysis speed . This approach is also the hot trend in the future. The core problem of big data storage technology is performance. A single technique and platform can no longer meet the demand of data explosive growth and the requirement of data analysis and storage from operators. In subsequent development, the new-type database will gradually be mixed with Hadoop ecosystem or Spark ecosystem , providing SQL and transaction support for application. Use Hadoop or Spark to achieve semi-structured, unstructured data processing. So in the future, storage will also be developed towards the combination of MPP parallel database cluster and Hadoop/Spark cluster. In addition, with the explosive growth of enterprise data, big data all-in-one machine will certainly become a hot technology, and be widely used.
By researching new data storage technology, this paper minutely summarizes and contrasts distributed file system, NoSQL database, database all-in-one machine and new-type database cluster of MPP architecture from different angles. And the future research tendency was put forward. Big data storage is still in the stage of rapid development. The development space is very big, and needs researchers to explore constantly.
Fund Project: National Natural Science Foundation of China (61462070); Regional science and technology planning project (20130364).