Home‎ > ‎6 pm class pages‎ > ‎

Jagdeep Deol

 Data warehouse/ data mart

 Data warehouse: Wikipedia page

YouTube Video

  Definition and History:

A data warehouse is a database used for reporting and analysis. One can think of a data warehouse as a place where a company stores all of its valuable data. A data warehouse can hold a company’s information about its customers, products, services, and employees. Prior to the 1980s, most companies had hard copies of raw data in physical files. Data usually was store in different places, rather than at one central location, where a company could access all of the necessary data that it needed to make strategic decisions.  In the late 1980s, the concept of data warehouse start to take shape when employees from IBM named Barry Devlin and Paul Murphy architected one of the first data warehouses (http://en.wikipedia.org/wiki/Data_warehouse). The data that is the stored in a data warehouse comes from operational systems. The functions/operations of a data warehouse can be classified into three layers, which are staging, integrating, and access. In the staging part, data is stored for use by developers.  The integration function is used to combine and organize the data. In the access layer, one can get the data that he/she needs from the warehouse.  For example, an operational system, such as marketing, can store information about different market segments in a data vault, and then that data can be stored into subsets of data called data marts. The data mart of a data warehouse is used to get data out of the data warehouse. Data of each department or business unit can be stored into separate data marts. Data marts can be independent or dependent of other data marts within a single organization (http://en.wikipedia.org/wiki/Data_mart).


  Image: http://upload.wikimedia.org/wikipedia/commons/4/46/Data_warehouse_overview.JPG

Benefits and functions:

A data warehouse mainly focuses on the storage of data. In the data warehouse, data is cleaned, transformed, categorized, and made available for the user. Data from a data warehouse is used by many businesses for data mining, online analytical processing, market research, and support decisions.  Some of the important parts of a data warehouse are retrieving and analyzing data, extracting, transforming and load data, and managing the data dictionary. Some of the benefits of a data warehouse include maintaining data history, integrating data from multiple areas which allows the data to be at a central place, improved data quality, presenting a company’s information consistently, restructuring data so that it makes sense to the user, provides outstanding search performance, and added value to operational businesses applications (http://en.wikipedia.org/wiki/Data_warehouse). 

Two common methods of storing data:

There are two approaches that are followed more than others for storing data which are the dimensional and normalization approach. The dimensional approach was developed by Ralph Kimball. The approach breaks transaction data into two categories, which are facts and dimensions. Facts usually provide numeric data, and dimensions are reference information that gives contexts to numeric data. For example, in a recording of a sales transaction, factual data would include the number of products ordered and the price that was paid for the products (http://en.wikipedia.org/wiki/Data_warehouse). The dimensions of the sales transaction would include data such as the order data, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order (http://en.wikipedia.org/wiki/Data_warehouse). Some of the advantages of using the dimensional approach to store data are that it is easier for the individuals to understand and use, and data can be quickly retrieve from the data warehouse.  The dimensional approach is the preferred storing method by many businesses because it is simpler to understand since the data is divided into measurements and contexts. Some disadvantages to the dimensional approach are that uploading data from different operational system can be a complicated process, and it can be hard to change a data warehouse structure if a business changes its method of doing business.

 The normalization approach stores data using a set of database normalization rules. The data in the normalization approach are divided into several different entities, which create several tables in a relational database (http://en.wikipedia.org/wiki/Data_warehouse). Some of the advantages of the approach is that it is easy to put data into the database, and a disadvantage of the approach is that the large number of tables make it hard for the user to combine data from various sources into valuable information, and it can be hard for one to access the information in the database without having a solid understanding of sources of data.

 Amazon's data warehouse

 Amazon’s data warehouse is composed of twenty eight Hewlett Packard servers and four CPUs per node, and it runs on Oracle’s 9i database software (http://money.howstuffworks.com/amazon1.htm). Amazon’s data warehouse is divided into three different areas, which are query, historical data, and ETL. The three different areas of Amazon’s data warehouse can be viewed as data marts within the data warehouse. In 2005, Amazon’s query servers had fifteen terabytes of data, fourteen terabytes of historical data, and five terabytes of ETL data (http://money.howstuffworks.com/amazon1.htm). During the holiday season in 2003, Amazon had twenty million inventory updates and one million shipments in a single day. All of the raw data that Amazon collects, such as a customer’s credit card number and the type of product that he/she bought, goes into the company’s data warehouse to potentially be used in the future to study the customer’s spend patterns. When the company needs to classify data and/or make predictions about future sales, it using a statistical technique called data mining.  Currently, Amazon’s data warehouse runs on the Netezza Performance Server (NPS) system, which can hold more than twenty five terabytes of clickstream and other transactional data. Some of the benefits of the NPS system over Amazon’s previous database system, which was operated by Oracle’s click stream database, are reduced analysis time which allows more routine examinations, reduced data processing time, and greater performance and speed. In addition to these benefits, the company’s new data warehouse system has also reduced or completely eliminated some of administrative and maintenance duties that are required to maintain a data warehouse (http://money.howstuffworks.com/amazon1.htm). One of Amazon’s data warehouse infrastructure managers said that only one database administrator is needed to maintain the NPS system, as oppose to four database administrators that are required to maintain Amazon’s other data warehouse systems.