Distributed storage systems are the solution to store and manage data that does not fit on a conventional server. In this sense, size is not the only problem, but classic file systems, with their folder structure, do not support unstructured data either..
When we talk about big data or big data , the amount of data that will have to be managed is not known at the beginning of the project. Therefore, systems must be able to be easily expanded , while continuing to function, with additional servers that can be seamlessly integrated into the given storage system. The so-called Distributed File System (distributed file system) is shown to the user as a simple folder of a traditional file system, so that said person does not intuit that the individual data or even parts of it may be located on servers different that, perhaps, are in geographical points far from each other. Since both GlusterFS and Ceph are already software layers in Linux operating systems, they do not require additional hardware features . Linux works on any standard server and is compatible with all common hard drives on the market.
High availability is an important issue in distributed storage solutions: hardware breakdowns should be avoided as much as possible and the software running that runs the system should not be interrupted when new components are added, or when necessary. maintenance work is necessary. Important metadata cannot be stored in a single central location, but must be accessible in a decentralized way and no items should be left without redundancy. In the event of a server failure, the entire system must never be compromised. GlusterFS and Ceph are systems to host data from big data projects in the same system and to be able to filter it from there. Both can be expanded almost as much as you like , but are based on different approaches ..
The term big data ( big data , massive data) refers to quantities, rather masses of data very large, complex and with little structure, such as those collected in certain sensors for scientific purposes (GPS satellites, for example) or in meteorological or statistical systems. In the field of big data, in addition to storage, efficient search and systematic organization of data play a key role.
GlusterFS is a distributed file system with a modular structure, in which several servers are connected to each other over a TCP / IP network. Since it is a POSIX (Portable Operating System Interface) compliant system, GlusterFS can be easily integrated into Linux server environments , as can FreeBSD, OpenSolaris, and macOS, which are also POSIX compliant. Integration in Windows environments, however, is currently only possible indirectly through a Linux server acting as a gateway ..
GlusterFS started out as a classic, file-based storage system. Later it became object-oriented and when making the change, special importance was placed on its ability to be properly integrated into the well-known open source solution OpenStack. In the background, GlusterFS continues to work with files: each file is assigned an object and the connection between them is established by hardlinks on the file system. For the user, no dedicated server is shown , but he or she has their own interfaces to save their data in GlusterFS, which is presented as a single system.
Ceph's open source distributed storage solution is an object-oriented memory based on binary objects, thus avoiding the rigid block structures of conventional data carriers. In terms of hardware, Ceph also uses hard drives, but an algorithm is responsible for managing the binary objects , which are divided into many parts and spread over many servers, but then re-unified.
All components work in a decentralized manner. All OSDs (Object Based Storage Device) have the same rights. In this way, as many servers as you like, with their different hard drives , can be connected to each other to form a unified storage system . Through three important interfaces, Ceph offers different possibilities to integrate it into the existing system environment: CephFS as a Linux file system driver, RADOS Block Devices (RBD) as a Linux device, which can be directly integrated; and RADOS Gateway, compatible with Swift and Amazon S3.
Since there are several technical differences between GlusterFS and Ceph, there is no clear winner . Ceph is in principle an object-based storage system for unstructured data, while GlusterFS uses tree-shaped file systems on block-based devices. GlusterFS originates from a highly efficient , file-based storage system , but is increasingly developing in an object-oriented way. Ceph, however, it was originally developed as storage device objects ( object storage ) binary , not a classic file system. This can lead to weaknesses in typical operations of traditional file systems.
Ceph, thanks to its varied interfaces, works well in heterogeneous networks , in which not only Linux is used, but also other operating systems. The strong point of GlusterFS, on the other hand, is the storage of large amounts of data in traditional format, as well as large data . Since Ceph was developed from the outset as an open source solution , in the past it was easier to use in many cases, until GlusterFS became open source as well. A very relevant area of application for distributed storage systems is cloud services. In this sense, OpenStack is one of the most important software projects offering architectures for cloud computing. Both GlusterFS and Ceph work equally well with OpenStack.