Hbase Interview Questions

• Hbase is a column-oriented database management system which runs on top of HDFS (Hadoop Distribute File System). Hbase is not a relational data store, and it does not support structured query language like SQL.
• In Hbase, a master node regulates the cluster and region servers to store portions of the tables and operates the work on the data.
• High capacity storage system
• Distributed design to cater large tables
• Column-Oriented Stores
• Horizontally Scalable
• High performance & Availability
• Base goal of Hbase is millions of columns, thousands of versions and billions of rows
• Unlike HDFS (Hadoop Distribute File System), it supports random real time CRUD operations
• Zookeeper: It does the co-ordination work between client and Hbase Maser
• Hbase Master: Hbase Master monitors the Region Server
• RegionServer: RegionServer monitors the Region
• Region: It contains in memory data store(MemStore) and Hfile.
• Catalog Tables: Catalog tables consist of ROOT and META
• Hbase consists of a set of tables
• And each table contains rows and columns like traditional database
• Each table must contain an element defined as a Primary Key
• Hbase column denotes an attribute of an object
Operational command in Hbases are five types
• Get
• Put
• Delete
• Scan
• Increment
• WAL (Write Ahead Log) is similar to MySQL BIN log; it records all the changes occur in data. It is a standard sequence file by Hadoop and it stores HLogkey’s. These keys consist of a sequential number as well as actual data and are used to replay not yet persisted data after a server crash. So, in cash of server failure WAL work as a life-line and retrieves the lost data’s.
• Data size is huge: When you have tons and millions of records to operate
• Complete Redesign: When you are moving RDBMS to Hbase, you consider it as a complete re-design then mere just changing the ports
• SQL-Less commands: You have several features like transactions; inner joins, typed columns, etc.
• Infrastructure Investment: You need to have enough cluster for Hbase to be really useful
• Column families comprise the basic unit of physical storage in Hbase to which features like compressions are applied.
• Row key is defined by the application. As the combined key is pre-fixed by the rowkey, it enables the application to define the desired sort order. It also allows logical grouping of cells and make sure that all cells with the same rowkey are co-located on the same server.
When you delete the cell in Hbase, the data is not actually deleted but a tombstone marker is set, making the deleted cells invisible. Hbase deleted are actually removed during compactions.
Three types of tombstone markers are there:
• Version delete marker: For deletion, it marks a single version of a column
• Column delete marker: For deletion, it marks all the versions of a column
• Family delete marker: For deletion, it marks of all column for a column family
• In Hbase, whatever you write will be stored from RAM to disk, these disk writes are immutable barring compaction. During deletion process in Hbase, major compaction process delete marker while minor compactions don’t. In normal deletes, it results in a delete tombstone marker- these delete data they represent are removed during compaction.
• Also, if you delete data and add more data, but with an earlier timestamp than the tombstone timestamp, further Gets may be masked by the delete/tombstone marker and hence you will not receive the inserted value until after the major compaction.
When you alter the block size of the column family, the new data occupies the new block size while the old data remains within the old block size. During data compaction, old data will take the new block size. New files as they are flushed, have a new block size whereas existing data will continue to be read correctly. All data should be transformed to the new block size, after the next major compaction.
• It is schema-less
• It is a column-oriented data store
• It is used to store de-normalized data
• It contains sparsely populated tables
• Automated partitioning is done in Hbase
• It is a schema based database
• It is a row-oriented data store
• It is used to store normalized data
• It contains thin tables
• There is no such provision or built-in support for partitioning
• 2006: BigTable paper published by Google.
• 2006 (end of year): HBase development starts.
• 2008: HBase becomes Hadoop sub-project.
• 2010: HBase becomes Apache top-level project.
• Apache Hbase is one the sub-project of Apache Hadoop,which was designed for NoSql database(Hadoop Database),bigdata store and a distributed, scalable.Use Apache HBase when you need random, realtime read/write access to your Big Data.A table which contain billions of rows X millions of columns -atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable. Apache HBase provides Bigtable-like capabilities run on top of Hadoop and HDFS.
• Apache HBase is a type of “NoSQL” database. “NoSQL” is a general term meaning that the database isn’t an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a “Data Store” than “Data Base” because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.

• Apache HBase has many features which supports both linear and modular scaling,HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows(Automatic sharding).HBase supports a Block Cache and Bloom Filters for high volume query optimization(Block Cache and Bloom Filters).
• HDFS doesn’t provides fast lookup records in a file,IN Hbase provides fast lookup records for large table.
• In Hbase,data is stored as a table(have rows and columns) similar to RDBMS but this is not a helpful analogy. Instead, it can be helpful to think of an HBase table as a multi-dimensional map.
• table(Hbase table consists of rows)
• row(Row in hbase which contains row key and one or more columns with value associated with them)
• column(A column in HBase consists of a column family and a column qualifier, which are delimited by a : (colon) character)
• column family(having set of columns and their values,the column families should be considered carefully during schema design)
• column qualifier(A column qualifier is added to a column family to provide the index for a given piece of data)
• cell(A cell is a combination of row, column family, and column qualifier, and contains a value and a timestamp, which represents the value’s version)
• timestamp( represents the time on the RegionServer when the data was written, but you can specify a different timestamp value when you put data into the cell)
• Get(returns attributes for a specified row,Gets are executed via HTable.get)
• put(Put either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists). Puts are executed via HTable.put (writeBuffer) or HTable.batch (non-writeBuffer))
• scan(Scan allow iteration over multiple rows for specified attributes)
• Delete(Delete removes a row from a table. Deletes are executed via HTable.delete)
HBase does not modify data in place, and so deletes are handled by creating new markers called tombstones. These tombstones, along with the dead values, are cleaned up on major compaction.
• Filters In Hbase Shell,Filter Language was introduced in APache HBase 0.92. It allows you to perform server-side filtering when accessing HBase over Thrift or in the HBase shell.
Total we have 18 filters are support to hbase.They are:
Apache MapReduce is a software framework used to analyze large amounts of data, and is the framework used most often with Apache Hadoop. HBase can be used as a data source, TableInputFormat, and data sink, TableOutputFormat or MultiTableOutputFormat, for MapReduce jobs. Writing MapReduce jobs that read or write HBase, it is advisable to subclass TableMapper and/or TableReducer.
There are two broad strategies for performing HBase backups: backing up with a full cluster shutdown, and backing up on a live cluster. Each approach has pros and cons.
1)Full Shutdown Backup
Some environments can tolerate a periodic full shutdown of their HBase cluster, for example if it is being used a back-end analytic capacity and not serving front-end web-pages. The benefits are that the NameNode/Master are RegionServers are down, so there is no chance of missing any in-flight changes to either StoreFiles or metadata. The obvious con is that the cluster is down.
2)Live Cluster Backup
live clusterbackup-copytable:copy table utility could either be used to copy data from one table to another on the same cluster, or to copy data to another table on another cluster. live cluster backup-export:export approach dumps the content of a table to HDFS on the same cluster.
Not really. SQL-ish support for HBase via Hive is in development, however Hive is based on MapReduce which is not generally suitable for low-latency requests.By using Apache Phoenix can retrieve data from hbase by using sql queries.
• There are 5 atomic commands which carry out different operations by Hbase.
Get, Put, Delete, Scan and Increment.
• A connection to Hbase is established through Hbase Shell which is a Java API.
• The Master server assigns regions to region servers and handles load balancing in the cluster.
• The zookeeper maintains configuration information, provides distributed synchronization, and also maintains the communication between clients and region servers.

• In Hbase a table is disabled to allow it to be modified or change its settings. .When a table is disabled it cannot be accessed through the scan command.
• Hbase > is_disabled "table name"
• hbase > disable_all 'p.*'
The command will disable all the tables starting with the letter p
• Hbase does not have in-built authentication/permission mechanism
• The indexes can be created only on a key column, but in RDBMS it can be done in any column.
• With one HMaster node there is a single point of failure.
• Hbase runs on top of Hadoop which is a distributed system. Haddop can only scale uo as and when required by adding more machines on the fly. So Hbase is a scale out process.
• In Hbase the client does not write directly into the HFile. The client first writes to WAL(Write Access Log), which then is accessed by Memdtore. The Memstore Flushes the data into permanent memory from time to time.
The catalog tables in Hbase maintain the metadata information. They are named as -ROOT- and .META. The -RROT- table stores information about location of .META> table and the .META> table holds information about all regions and their locations.
• As more and more data is written to Hbase, many HFiles get created. Compaction is the process of merging these HFiles to one file and after the merged file is created successfully, discard the old file.
• There are two types of compaction. Major and Minor compaction. In minor compaction, the adjacent small HFiles are merged to create a single HFile without removing the deleted HFiles. Files to be merged are chosen randomly.
• In Major compaction, all the HFiles of a column are emerged and a single HFiles is created. The delted HFiles are discarded and it is generally triggered manually.
• The Delete column command deletes all versions of a column but the delete family deletes all columns of a particular family.
• This class is used to store information about a column family such as the number of versions, compression settings, etc. It is used as input when creating a table or adding a column.
• The lower bound of versions indicates the minimum number of versions to be stored in Hbase for a column. For example If the value is set to 3 then three latest version wil be maintained and the older ones will be removed.
• TTL is a data retention technique using which the version of a cell can be preserved till a specific time period.Once that timestamp is reached the specific version will be removed.
• Hbase does not support table jons. But using a mapreduce job we can specify join queries to retrieve data from multiple Hbase tables.
• Each row in Hbase is identified by a unique byte of array called rowkey.
The data in Hbase can be accessed in two ways.
• Using the rowkey and table scan for a range of row key values.
• Using mapreduce in a batch manner.
• They are -
(i) Short and Wide
(ii) Tall and Thin
The short and wide table design is considered when there is • There is a small number of columns • There is a large number of rows
The tall and thin table design is considered when there is
• There is a large number of columns
• There is a small number of rows
hbase > alter 'tablename', {NAME => 'ColFamily', VERSIONS => 4}
hbase > alter 'tablename', {NAME => 'colFamily', METHOD => 'delete'} This command deletes the column family form the table.
Hbase > disable 'tablename'
Hbase > alter 'tablename' {NAME => 'oldcolfamily',NAME=>'newcolfamily'}
Habse > enable 'tablename'
scan 'tablename', {LIMIT=>10,
major_compact 'tablename'
Run a major compaction on the table.
There are two main steps to do a data bulk load in Hbase.
• Generate Hbase data file(StoreFile) using a custom mapreduce job) from the data source. The StoreFile is created in Hbase internal format which can be efficiently loaded. • The prepared file is imported using another tool like comletebulkload to import data into a running cluster. Each file gets loaded to one specific region.
• Hbase uses a feature called region replication. In this feature for each region of a table, there will be multiple replicas that are opened in different RegionServers. The Load Balancer ensures that the region replicas are not co-hosted in the same region servers.
• The Hmaster is the Master server responsible for monitoring all RegionServer instances in the cluster and it is the interface for all metadata changes. In a distributed cluster, it runs on the Namenode.
• HRegionServer is the RegionServer implementation. It is responsible for serving and managing regions. In a distributed cluster, a RegionServer runs on a DataNode.
HBase provides two different BlockCache implementations: the default on-heap LruBlockCache and the BucketCache, which is (usually) off-heap.
• The Write Ahead Log (WAL) records all changes to data in HBase, to file-based storage. if a RegionServer crashes or becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to the data can be replayed.
• With a single WAL per RegionServer, the RegionServer must write to the WAL serially, because HDFS files must be sequential. This causes the WAL to be a performance bottleneck.

• When a region is edited, the edits in the WAL file which belong to that region need to be replayed. Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region. The process of grouping the WAL edits by region is called log splitting.
• WAL can be disabled to improve performance bottleneck. This is done by calling the Hbase client field Mutation.writeToWAL(false).
• The manual region splitting is done we have an unexpected hotspot in your table because of many clients querying the same table.
• A Habse Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region.
• The HFile in Habse which stores the Actual data(not metadata) is designed after the SSTable file of BigTable.
• Tables in HBase are initially created with one region by default. Then for bulk imports, all clients will write to the same region until it is large enough to split and become distributed across the cluster. So, empty regions are created to make this process faster.
• Hotspotting is asituation when a large amount of client traffic is directed at one node, or only a few nodes, of a cluster. This traffic may represent reads, writes, or other operations. This traffic overwhelms the single machine responsible for hosting that region, causing performance degradation and potentially leading to region unavailability.
• Hotspotting can be avoided or minimized by distributing the rowkeys across multiple regions. The different techniques to do this is salting and Hashing.
• In Hbase values are always freighted with their coordinates; as a cell value passes through the system, it’ll be accompanied by its row, column name, and timestamp. If the rows and column names are large, especially compared to the size of the cell value, then indices that are kept on HBase storefiles (StoreFile (HFile)) to facilitate random access may end up occupying large chunks of the HBase allotted RAM than the data itself because the cell value coordinates are large.
• Rowkeys are scoped to ColumnFamilies. The same rowkey could exist in each ColumnFamily that exists in a table without collision.
• The Hbase:meta tables stores details of region in the system in the following format.
info:regioninfo (serialized HRegionInfo instance for this region)
info:server (server:port of the RegionServer containing this region)
info:serverstartcode (start-time of the RegionServer process containing this region)
• A Namespace is a logical grouping of tables. It is similar to a database object in a Relational database system.
• The complete list of columns in a column family can be obtained only querying all the rows for that column family.
• The records fetched form Hbase are always sorted in the order of rowkey-> column Family-> column qualifier-> tiestamp.
A logical deviation of a data represented by a key is called column family. Virtually column families form dynamically based on data, which holds the multiple columns of related data. All column members of a column family have the same prefix. For example Vehicle is a column Maruthi, Tata, Hero are the sub column of the Vehicle. So here Vehicle is considered as column family.

Eg: Hbase > put ‘cars’, ‘price’, ‘ Vehicle:Maruthi’, ‘1,00,000’ // The syntax should be in order table, row, column family, value.
put ‘cars’, ‘price’, ‘Vehicle:Tata’,’2,00,000'
put ‘cars’, ‘price’,’Vehicle:Hero’,’3,00,000'
Here, cars is a table, Vehicle is a column family and 1,00,000 value.
• Hbase suitable for low latency request, but mapreduce is high latency. Hadoop not support updates, but Hbase can support. Hadoop can store matadata only, but hbase can index the data.
Hbase has Put and Result interface which converts bytes and stored in an array as a value. So it can support any datatype like string, number, image or anything that can rendered as bytes. Typecasting is always possible.

• Hbase provides 2 different block cache, such as onheap and offheap cache also called LruBlockCache (default) and bucketCache. Onheap cache is implemented from Java heap, where as bucketCache implemented from fileblock cache.
create ‘table’, ‘columnFamily’
put ‘table’, ‘rwo’, ‘columnFamily’, ‘value’
get’table’, ‘row’, ‘columnfamily’, ‘value’
scan ‘table’, ‘row’, ‘columnfamily’, ‘value’
list ‘tablename’
disable ‘table’
drop ‘table’
describe ‘table’
• Menstore is a temporary repository in Hbase, which holds data inmemory modifications to the Store. It’s store Maximum HDFS block size data once reaches maximum size(64MB), it flushes the data into a HDFS.
• DFSCllient handels all remote server’s interactions. It means to communicate with NameNode, Datanodes or JobTracker/YARN, required DFSClient. Hbase persists the data in HDFS via DFS client.
• Hbase dynamically distributes by the system when it is getting huge amount of data, this feature called autosharding.
• ulimit is a upper bound of the process. nproc can limit the maximum number of processes available for a particular application, which restrict the processes.
• Bloom filters are filtering out blocks that you don’t need which can save your disk and improve read latency.
• Memory utilization and caching structures are too important in Hbase. To archive it’s goal, HBase maintain two cache structures called MemStore and BlockCache. MemStore is a temporary repository and buffering in memory. Block cache keeps data blocks in memory after read.
• Use below hive command to get a detailed description of a hive table. hive> describe extended <tablename>;
• Block is a single smallest amount/unit of data. There are 4 type of veriets such as: Data, Meta, Index and Bloom. Data locks store User data. Index and Bloom blocks serve to speed up the read path. Index provides index of the particular Data blocks. Bloom block contain a bloom filter, that filters the data and display desired data quickly. Meta blocks store information about Hfile.
• Hmaster serves one or more HRegion Servers. • Each HRegion Server serves one or more Region. • Each Region serves one Hlog and multiple Stores. • Each Store serves one MemStore and multiple StoreFile. • Each Store file has only one Hfile. • Each Hfile can hold 64kb of data.
First the client written the data to HregionServer. First data stored data in (write ahead log) Hlog file, then the data is written to MemStore. Memstore temporary holds the data. If Memstore is full, it flush the data to Hfile. The data is ordered in Memstore and Hfile. Which is the temporary repository in the Hbase. Which persist the data on HDFS via DFS client.
• use help followed by command, for example, help 'scan'
• When schema level updates / alternations done, it’s not possible to run CRUD operations. To alter schema level updations first disable the table, it’s mandatory.
• If you are going to open connection with the help of Java API, the following code provides the connection
Configuration myConf = HBaseConfiguration.create();
HTableInterface usersTable = new HTable(myConf, “users”);
• Theres a very big difference between storage of relational/row-oriented databases and column-oriented databases. For example, if I have a table of users and I need to store friendships between these users… In a relational database my design is something like:
Table: users(pkey = userid) Table: friendships(userid,friendid,…) which contains one (or maybe two depending on how its impelemented) row for each friendship.
• In order to lookup a given users friend, SELECT * FROM friendships WHERE userid = myid;
• The cost of this relational query continues to increase as a user adds more friends. You also begin to have practical limits. If I have millions of users, each with many thousands of potential friends, the size of these indexes grow exponentially and things get nasty quickly. Rather than friendships, imagine Im storing activity logs of actions taken by users.
• In a column-oriented database these things scale continuously with minimal difference between 10 users and 10,000,000 users, 10 friendships and 10,000 friendships.
• Rather than a friendships table, you could just have a friendships column family in the users table. Each column in that family would contain the ID of a friend. The value could store anything else you would have stored in the friendships table in the relational model. As column families are stored together/sequentially on a per-row basis, reading a user with 1 friend versus a user with 10,000 friends is virtually the same. The biggest difference is just in the shipping of this information across the network which is unavoidable. In this system a user could have 10,000,000 friends. In a relational database the size of the friendship table would grow massively and the indexes would be out of control.
By designing two tables:
Student: student id student data (name, address, …) courses (use course ids as column qualifiers here) Course: course id course data (name, syllabus, …) students (use student ids as column qualifiers here)
A rough rule of thumb, with little empirical validation, is to keep the data in HDFS and store pointers to the data in HBase if you expect the cell size to be consistently above 10 MB. If you do expect large cell values and you still plan to use HBase for the storage of cell contents, you’ll want to increase the block size and the maximum region size for the table to keep the index size reasonable and the split frequency acceptable.
• Because of the way HFile works: for efficiency, column values are put on disk with the length of the value written first and then the bytes of the actual value written second. To navigate through these values in reverse order, these length values would need to be stored twice (at the end as well) or in a side file. A robust secondary index implementation is the likely solution here to ensure the primary use case remains fast.
When we change the block size of the column family, the new data takes the new block size while the old data is within the old block size. When the compaction occurs, old data will take the new block size. "New files, as they are flushed, will have the new block size, whereas existing data will continue to be read correctly. After the next major compaction, all data should be converted to the new block size"

• It would be better when our site needs to scale so massively that the best RDBMS running on the best hardware we can afford and optimize as much as possible simply can’t keep up with the load. How much better it is depends on the specific use case (lots of update activity combined with lots of joins is very hard on “traditional” RDBMSs) – could well be a factor of 1000 in extreme cases
• Hive doesn’t support record level operations but HBase support record level operations.
• HBase has many features which supports both linear and modular scaling. HBase clusters expand by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 RegionServers, for example, it doubles both in terms of storage and as well as processing capacity.
• RDBMS can scale well, but only up to a point – specifically, the size of a single database server – and for the best performance requires specialized hardware and storage devices.
• S3 stands for simple storage service and it is a one of the file system used by hbase.
get() method is used to read the data from the table.
• There are two run modes of Hbase i.e. standalone and distributed.
• It is a default mode of Hbase .In standalone mode, HBase does not use HDFS—it uses the local filesystem instead—and it runs all HBase daemons and a local ZooKeeper in the same JVM process.
• It is useful to modify, or extend, the behavior of a filter to gain additional control over the returned data.
• YCSB stands for Yahoo! Cloud Serving Benchmark.
• It can be used to run comparable workloads against different storage systems.
• Hbase supports those OS which supports java like windows, Linux.
• The most common file system of HBase is HDFS i.e. Hadoop distributed file system
• A pseudodistributed mode is simply a distributed mode that is run on a single host.
• It is a file which lists the known region server names.
• Version command is used to show the version of hbase.
Syntax – hbase> version

• This command is used to list the hbase surgery tools.
• It is used to shut down the cluster.
• It is used to disable, recreate and drop the specified tables.
• $ ./bin/hbase shell command is used to run the hbase shell.
• whoami command is used to show Hbase user.
• To delete table first disable it then delete it.
• Lempel-Ziv-Oberhumer (LZO) is a lossless data compression algorithm that is focused on decompression speed, and written in ANSIC.
• InputFormat the input data, and then it returns a RecordReader instance that defines the classes of the key and value objects, and provides a next() method that is used to iterate over each input record.
• HBase comes with a tool called hbck which is implemented by the HBaseFsck class. It provides various command-line switches that influence its behaviour.
• Rest stands for Representational State Transfer which defines the semantics so that the protocol can be used in a generic way to address remote resources. It also provides support for different message formats, offering many choices for a client application to communicate with the server.
• Apache Thrift is written in C++, but provides schema compilers for many programming languages, including Java, C++, Perl, PHP, Python, Ruby, and more.
• The fundamental key structures of Hbase are row key and column key.
• The Java Management Extensions technology is the standard for Java applications to export their status.
• Nagios is a very commonly used support tool for gaining qualitative data regarding cluster status. It polls current metrics on a regular basis and compares them with given thresholds.
• The zookeeper is used to maintain the configuration information and communication between region servers and clients. It also provides distributed synchronization.
• HBase shell is a java API by which we communicate with Hbase.
• exists command is used to check that the specified table is exists or not.
In HBase 0.96, the project moved to a modular structure. Adjust your project’s dependencies to rely upon the hbase-client module or another module as appropriate, rather than a single JAR. You can model your Maven depency after one of the following, depending on your targeted version of HBase. See Section 3.5, “Upgrading from 0.94.x to 0.96.x” or Section 3.3, “Upgrading from 0.96.x to 0.98.x” for more information.
Maven Dependency for HBase 0.98
Maven Dependency for HBase 0.96
Maven Dependency for HBase 0.94
• Always start with the master log (TODO: Which lines?). Normally it’s just printing the same lines over and over again. If not, then there’s an issue. Google or search-hadoop.com should return some hits for those exceptions you’re seeing.
• An error rarely comes alone in Apache HBase, usually when something gets screwed up what will follow may be hundreds of exceptions and stack traces coming from all over the place. The best way to approach this type of problem is to walk the log up to where it all began, for example one trick with RegionServers is that they will print some metrics when aborting so grapping for Dump should get you around the start of the problem.
• RegionServer suicides are ‘normal’, as this is what they do when something goes wrong. For example, if ulimit and max transfer threads (the two most important initial settings, see [ulimit] and dfs.datanode.max.transfer.threads ) aren’t changed, it will make it impossible at some point for DataNodes to create new threads that from the HBase point of view is seen as if HDFS was gone. Think about what would happen if your MySQL database was suddenly unable to access files on your local file system, well it’s the same with HBase and HDFS. Another very common reason to see RegionServers committing seppuku is when they enter prolonged garbage collection pauses that last longer than the default ZooKeeper session timeout.
Request a Call Back

Enter your Phone Number

Quick Contact

* Required Field


Get A Free Quote / Need a Help ? Contact Us