Tuesday, January 10, 2017

MongoDB Vs Cassandra

Comparison of Both the database in terms of Security 

1      Introduction

Both the databases are open source where one is document oriented and other one is for larger database. These database are family for NoSQL. The NoSQL database is mainly designed to increase scalability, fast storage, fast access to data and security (Anon., n.d.). This database can run on large node and is capable of achieving numbers of features that was not possible with RDBMs. There won’t be conflict on reading and writing of data at once. The data are distributed over thousands of machines and are in the form of clusters and access by nodes or routers. In this paper the comparison of both the database is done in terms of performance, storage, retrieval time, scalability, reliability and security. The database model of these database varies in terms where MongoDB is used for document store and Cassandra is used for Wide column store. Cassandra was developed in 2008 by apache software foundation and MongoDB was developed by MongoDB inc. The language that uses these database are java for Cassandra and C++ for MongoDB (Anon., n.d.). The schema free is both the database. There is no server side script for Cassandra but for MangoDB, JavaScript is used as server side.  
The requirement of all three of CAP can’t be fulfilled. The MongoDB flows CP where was AP is followed by Cassandra. CP states that some of data can be accessed and some of data could be accurate whereas AP sates that some data could be returned inaccurate. The application of Cassandra mostly covers IOT, recommendation engines, fraud detection application, playlists, product catalogs and messaging application. It is based on scalability (class) of NoSQL (Bushik, 2012). Whereas MongoDB helps businesses get transformed using harnessing the power of data that are stored. It is used by organization for startups on larger companies for creating applications that does complex tasks. The Cassandra requires minimal administration compared to MongoDB. This report presents all the aspect of both the database and its comparison is made.

2.     MongoDB

The MongoDB uses single instance operation and supports standalone. The performance provided by MongoDB is very high which is done using replica set which handles failures (MongoDB, n.d.). The cluster makes the division of large set of data and store in different machines. The high redundancy is provided combining replica set and clusters (sharded) and the data is found to be transparent to the applications. The main feature of MongoDB are as given below:
·        Iterative and fast development.
·        Data model with flexible feature.
·        Scalability with multi-datacenter.
·        Feature set that are integrated.
·        TCO is lower.
·        Commitment that is for long term.
·          Flexibility   

Data Management for MongoDB

Linear scalability
The horizontal scale out is provide by MongoDB which is cost efficient using sharding. This process is transparent to software applications. This sharding makes the data to distribute to different and multiple partitions which is also known as shards. The limitation that is occurred due to bottleneck is being solved which deployment of MongoDB in this pattern (Ellis, 2009). The complexity is reduced in this case. When the data get bigger the clustering of data is being done and the size of cluster is increased. This whole process is automatically maintained unlike other databases. There is no effort required for the application developer for sharding logic. There is also multiple sharding allowed in this database which makes it easy for developer to distribute data in the cluster at number of resources.  There is high scalability with workloads and they are as given below:
Sharding in range
As we know the MongoDB is mainly used to store documents, these documents are partitioned in number of shards which is determined by shard key and value pair. There is high possibility that if two documents have close key values being closer to each other in cluster.
Sharding Hash
The encryption used in this database is MD5 hash for document distribution. It give reliability to the data to be distributed properly in the shards (Gajendran, 2012).
Sharding zone
This provides operation of defining own rules for data placement within the shard zone cluster. This provides a range to data distributions. The data refining could be done continuously by the administrator and can change the key value for data migration (Hoberman, 2014).

2.1     Architecture of MongoDB

The diagram below gives the model of MongoDB architecture. It contains application server, configuration servers and shared MongoDB which is replica set. The components that sharded cluster has are shards, configuration servers, query routers. The data are stored into shards that has replica set and it provides data consistency and availability (Anon., n.d.). The router in the diagram is the query router, it handles the query and provides the interface with the application used by clients. This gives direct access to the data in the shard. The main operation of router is to target the data at shards and return the data to the clients. There could be number of router that gives fast access to the data and provide high availability.
The config servers’ gives feature of storing metadata that are of clusters. There is mapping of the cluster and its dataset with the shards data. These metadata are used by the routers to access the particular data in the shards. There are 3 configure servers in sharded clusters as shown in the diagram.  


Figure 1: Architecture of MongoDB

2.2     Security

During this last decade, there has been significant increase in hacking and issues with data security. By 2021, it is predicted that cybercrime might cost $6.2 trillion annually in global economy. There is always threat for the industry which is related to data security. The data plays vital role in industry for its growth and analysis of business. It is task of administrators at industry to secure all its data from being manipulated and hacked. The MongoDB consists of security measures for defending itself, controlling access to data and detection of changes in database (Anon., n.d.). The diagram below gives the overview of the security. 

Figure 2: MongoDB
There is external security measure of authentication and accessing the database. These include LDAP, Kerberos, PKI certificates and Windows Active Directory. The lightweight directory access protocol is used mostly in business computer networks which operates in distributed list (Hoberman, 2014). The computer that wants to access LDAP must be logged into the server and follow the protocol.
The authentication provides much security but there is requirement for high secured authorization services as well. In MongoDB the permission for the users could set according to access mode. It could also be used within LDAP server. The auditing is provided and it can be used by the administrators for determining and tracking access in log.
Encryption is one of the oldest and most effective measure for data security. MongoDB uses this technique for encrypting its data on the network. There is separate engine for encryption, protection of data. These building feature in MongoDB gives proper management and performance in data access and protection. The encrypted data can only be accessed by the authorized users.

3.     Cassandra

The Cassandra is column oriented database, distributed, fault tolerant, scalable and high performance (Hewitt, 2010). It is difficult to get high availability of data with big data storage therefor the data are stored in different location and portion is done. The Cassandra provides such high availability of data and there are other more feature of this database that are given below:
  • ·        Handles high amount of data (Big data)
  • ·        Access is fast and random
  • ·        Schema is variable
  • ·        The same data is seen at the same time by all the nodes.
  • ·        The processing and access of data are need to do fast.
  • ·        It requires partition of data and distribution.
  • ·        Availability is higher than other database.

All the three that is Availability, consistency and partition tolerance can’t be achieved once fully. The Cassandra gives high availability but lacks in consistency. It was developed by Avinash Lakshman for powering Facebook messaging search. In this database each and every node of the database points to the same role and it doesn’t has any change to get failed. Similarly as MongoDB, the data distribution is in clusters (Ellis, 2009). All the strategies associated with replication are flexible for configuration according to need by administrator. The designing for database is done according to distributed system so that there could be multiple data centers and larger nodes.
It is specially designed for disaster recovery. With the addition of new machine, there is significant increase in throughput for reading and writing for data. The replication of data is automatically done into number of nodes so that there could be fault-tolerance. This gives data security for cloud computing as well. The integration of hadoop including mapreduce support is on this database which supported by apache hive as well (Abramova, 2014). There is separate query language for Cassandra that is known as CQL. This is an alternative for SQL which gives an additional layer that hides detail about the database structure. The drivers are also available for java i.e. JDBC and other number of languages.

3.1     Architecture of Cassandra

The structure of Cassandra contains node, cluster, data center, table, commit log, mem-table, bloom filter (Gajendran, 2012). The architecture of Cassandra is being given in this section. Before understanding the architecture, it should be known that Cassandra was developed understanding that the system failure is likely to occur and do occur. The distribution is in peer-to-peer where all the nodes are same.
The partition of data is done automatically when writing data into the database. Hence, these is no specific place where the data could be written sequentially but data could be anywhere. The commit log gets the data at the beginning and then the data is also written in memory structure that is mem-table (Bushik, 2012). The diagram below is the architecture of Cassandra, there are two Cassandra clusters which contains web client assess and numbers nodes. The cluster configuration is provided by middle tier architecture.  
The architecture of Cassandra also supports replication of data for fault tolerance and efficiency.


Figure 3: Architecture of Cassandra

3.2     Security 

Security for any data is most important in today’s world. The industry always focus on data that can’t be manipulated and accessed by other 3rd party. The users can be created by the administrators who are given permission of accessing database. The command that is used is create user. The internal architecture of Cassandra manages the user and its password into its clustering database. The query language of its own can used to drop such users or alter then accordingly (Bushik, 2012). The permission management are in control of administrator for granting different levels of permissions to the user for accessing data. Hence for security purposes the Cassandra provides number of feature for its security and they are as given below:
3.2.1  Encryption on client to node
This is an extra security option that is provided by Cassandra. The SSL server provides high security for helping data not be to compromise. The communication with data cluster and client is maintained using SSL encryption. This is maintained independent in Cassandra. For addition security the setting of Cassandra.yaml file could be overridden in virtual machine. At the virtual machine level the configuration and protocol can be changes according to industry for more security. The SSL encryption is used for Cassandra database which is for client to node, node to node, server certification. The data is protected from the client machine side using secure socket layer. Similarly the data transfer is also protected in cluster. The generation of certification is carried out for all these protection.
3.2.2  Authentication
This database also follows the protocol for authentication which can be pluggable into Cassandra. The use of authenticator setting in Cassandra.yaml file enables the administrators for use these features. Allowallauthenticator is at the beginning by default which acts as authentication and it doesn’t require credentials. There is also passwordauthenticator for default use of authentication in Cassandra and the credentials are stored by encryption (Hewitt, 2010)
3.2.3  Authorization

The authorization can be configured in Cassandra using authorizer setting in Cassandra.yaml file. Its configured allowallauthorizer by default that doesn’t check for permission and gives all user permission to use. The Cassandra provides options for adding security and changes it according to use. It is flexible to get level of security that is required by the industry and administrators (Ellis, 2009).

No comments:

Post a Comment