My research at Stanford University: Information Integration

Read more

Implementing and running large scale Six Sigma programs.

IT Service Management and IT Outsourcing.


Read more

Information Integration

The problem of information integration has received considerable attention due to the vast number of structured information sources available online. The goal of information integration systems is to provide a uniform query interface to multiple information sources, thereby freeing the user from having to locate the relevant sources, query each one in isolation, and combine manually the information from the different sources.

Information Integration

General architecture of an information integration system. The user interface and the information sources can be modeled for the purpose of query planning by sets of relations, called world relations and source relations respectively. Source descriptions relate source and world relations.

Information integration systems are based on the following general architecture. The user interacts with a uniform interface in the form of a set of global relation names that are used in formulating queries. These relations are called world relations. The actual data is stored in external sources, called the source relations. In order for the system to be able to answer queries, we must specify a mapping between the world relations and the source relations. A  common method to specify these mappings is to describe each source relation as the result of a conjunctive query (i.e., a single Horn rule) over the world relations. For example, an information source storing nonstop flights offered by United Airlines would be described as follows:

CREATE VIEW Flights_by_United
SELECT number, from, to
FROM Nonstop
WHERE airline = 'UA

The relation Nonstop is a world relation and can be used in formulating queries, and Flights_by_Unitedis a source relation.

Given a query from the user, formulated in terms of the world relations, the system must translate it to a query that mentions only the source relations, because only these relations are actually available. That is, the system needs to find a query expression, that mentions only the source relations, and is equivalent to the original query. The new query is called a query plan. The problem of finding a query plan is the same as the problem of answering queries using views. In this context, the views are the relations in the sources. The problem of answering queries using views has also been investigated in the database literature because of its importance for query optimization and data warehousing.

Most previous work has considered the problem of finding query plans where the query plan is required to be  equivalent to the original query.  In practice,  the collection of available information sources may not contain all the information needed to answer a query, and therefore, we need to resort to maximally-contained query plans. A maximally-contained plan provides all the answers that are possible to obtain from the sources, but the expression describing the plan may not be equivalent to the original query. For example, if we only have the Flights_by_United source available, and our query asks for all flights departing from San Francisco International Airport, then the following is a maximally-contained query plan:

SELECT 'UA', number, to
FROM Flights_by_United
WHERE from = 'SFO'

Anwering queries using views

A view is the result of evaluating a query. The query that generates the view is called view definition. One application of views in current database systems is security. For example, a student is only allowed to see his or her own grades in the university database system, but not the grades of other students. This limited access to the database is implemented by defining a view for every student
that contains exactly the information that the student is allowed to see. Students then are only allowed to query their view, but not the underlying database. We will use views in order to describe information available from information sources. Data stored by information sources can be seen as views over a global schema.

Query planning in information integration systems

Query planning in information integration systems is the task of transforming a user query, represented in the user's interface language and vocabulary, into queries that can be executed by the information sources. Every information source might require a different query language and might use different vocabularies. The resulting answers of the information sources need to be translated and combined before the final answer can be reported to the user.

The examples in this section use a global schema with the relation Nonstop(airline,number,from,to). The intended meaning of the tuple <UA,2021,SFO,LAX> in this relation, for example, is that United Airlines (UA) flight 2021is a nonstop flight from San Francisco (SFO) to Los Angeles (LAX). Consider the following two views:

CREATE VIEW Flights_by_United
SELECT from, to
FROM Nonstop
WHERE airline = 'UA'

CREATE VIEW Flights_from_SFO
SELECT airline, number, to
FROM Nonstop
WHERE from = 'SFO'

View Flights_by_United stores the nonstop flights operated by United Airlines, and view Flights_from_SFO stores the nonstop flights out of San Francisco. The first view might represent the data that is available from a United Airlines database, whereas the second view might be stored in a database at San Francisco International Airport.


Equivalent query plans

The query planning problem in information integration systems is very closely related to the problem of answering queries using views. User queries are posed in terms of a global schema. Information sources can be seen as views over this global schema. In order to answer a user query from the data available from the information sources, the query must be rewritten so that it only uses the views.

A user might be interested to know which nonstop flights from San Francisco to Los Angeles are offered. The following is the corresponding SQL query:

SELECT airline, number FROM Nonstop WHERE from = 'SFO' AND to = 'LAX'

It is easy to rewrite her query into an equivalent query that uses only the views defined in Example \ref{example-introduction-views}. View Flights_from_SFO stores all nonstop flights out of San Francisco, and therefore also all nonstop flights from San Francisco to Los Angeles. It follows that the query rewrite

SELECT airline, number FROM Flights_from_SFO WHERE to = 'LAX'

answers the user's query. The rewriting requires the view Flights_from_SFO but does not require any relations from the global schema. Therefore, it can be answered by querying the database at San Francisco International Airport. We will refer to queries that can be answered by using the views as query plans.

Maximally-contained query plans

The user in the example was lucky because her query could be answered exactly from the available views. In general, however, the available views might not provide all the information that is needed to answer exactly the user's query. In these cases, we still want to give the user some answer, indeed the "best" answer that can be given using only the available views. The query plan that computes this "best" answer is called maximally-contained query plan.

The user who was interested in flights from San Francisco to Los Angeles might want to continue her flight to Phoenix. Therefore, she might ask for nonstop flights from Los Angeles to Phoenix:

SELECT airline, number
FROM Nonstop
WHERE from = 'LAX' AND to = 'PHX'

Given only the two available sources - the United Airlines flight database and the database of San Francisco International Airport - there is no way to find a query plan that only uses these sources and is equivalent to the original query. Nonetheless, it is possible to extract some information out of the available sources. Indeed, if United Airlines offers nonstop flights from Los Angeles to Phoenix, then this information would be available from the United Airlines flight database. The query plan

SELECT 'UA', number
FROM Flights_by_United
WHERE from = 'LAX' AND to = 'PHX'

requests this information from the United Airlines database.

Functional dependencies

Frequently, relations in databases satisfy functional dependencies. For example, consider the relation


The intended meaning of the tuple <UA 2021, 08/21, Mike, #111> in this relation is that on August 21st United Airline flight 2021 has Mike as pilot and is using the aircraft with identification number #111. Pilots work for one airline only, and airlines don't have joint
ownership of aircraft. Therefore, relation Schedule satisfies the functional dependencies pilot -> airline and aircraft -> airline. If we know that pilot Mike works for United Airlines, for example, then we can be sure that Mike does not work for American Airlines as well.

Using functional dependencies, more information might be extracted from information sources than would be possible otherwise. Algorithms that don't take functional dependencies into account in the query planning process might fail to produce all correct answers, i.e. the generated query plans might not be maximally-contained in the user query.

Assume that there is an additional view available described by the following definition:

CREATE VIEW Work_Schedule
SELECT date, from, to, pilot, aircraft
FROM Nonstop, Schedule
WHERE Nonstop.airline = Schedule.airline AND Nonstop.number = Schedule.number

This view would store, for example, the tuple

<08/21, SFO, LAX, Mike, #111>

expressing that on August 21st pilot Mike flies from San Francisco to Los Angeles on aircraft #111. A user might be interested in all the pilots that work for the same airline as Mike. The corresponding SQL query is:

SELECT S1.pilot
FROM Schedule AS S1, Schedule AS S2
WHERE S1.airline = S2.airline AND S2.pilot = 'Mike'

View Work_Schedule is the only view that stores any information about relation Schedule. Unfortunately, this view doesn't store the airline attribute explicitly. Therefore, without any consideration of functional dependencies, this query couldn't be answered at all. The functional dependencies pilot -> airline and aircraft -> airline though can be used to extract more information from this view. To see that these functional dependencies might really yield more answers, consider the following instance of view Work_Schedule:

date from to pilot aircraft
08/21 SFO LAX Mike #111
09/05 PHX ATL Ann #111
09/12 SLT ORD Ann #222
10/04 SEA BOS John #222

In this example, Mike and Ann fly the same aircraft. Because of the functional dependency aircraft -> airline it follows that Mike and Ann work for the same airline. In general, the query plan

SELECT W1.pilot
FROM Work_Schedule AS W1, Work_Schedule AS W2
WHERE W1.aircraft = W2.aircraft AND S2.pilot = 'Mike'

is contained in the user query if this functional dependency holds, but is not contained in the user query otherwise.

The preceding example showed that functional dependencies might lead to more query plans that are contained in the given user query. Indeed, we will show that there is no maximally-contained query plan in this case if the queries are restricted to use join, projection, selection, and union only. More precisely, for every query plan that is contained in the user query, there is another query plan contained in the user query that might produce some new answers. In order to get a maximally-contained query plan it is necessaryto consider query plans that contain recursion.

Using recursion there is a query plan that is maximally-contained in the query of the example. The maximally-contained recursive query plan is the following:


(SELECT pilot FROM Work_Schedule WHERE pilot = 'Mike')
(SELECT pilot FROM Work_Schedule, Aircraft WHERE Work_Schedule.aircraft = Aircraft.aircraft),

(SELECT aircraft FROM Work_Schedule WHERE pilot = 'Mike')
(SELECT aircraft FROM Work_Schedule, Pilots WHERE Work_Schedule.pilot = Pilots.pilot),

SELECT pilot
FROM Pilots

For the convenience of the reader who might be more used to representing recursive queries in datalog notation instead of SQL, we also give the datalog query that corresponds to the SQL query. Note that in datalog it is conventional to denote relations and constants by lower case names, and variables by upper case names.

q(Pilot) :- pilots(Pilot)

pilots(Pilot) :- work_schedule(Date,From,To,mike,Aircraft)

pilots(Pilot) :- work_schedule(Date,From,To,Pilot,Aircraft), aircraft(Aircraft)

aircraft(Aircraft) :- work_schedule(Date,From,To,mike,Aircraft)

aircraft(Aircraft) :- work_schedule(Date,From,To,Pilot,Aircraft), pilots(Pilot)

The query computes two intermediate relations, Pilots and Aircraft. Relation Pilots stores all the pilots that work for the same airline as Mike, and relation Aircraft stores all the aircraft that are owned by the airline that Mike works for. Obviously, if the tuple <08/21, SFO, LAX, Mike, #111> is in view Work_Schedule, then Mike is one of the pilots in relation Pilots, and aircraft #111 is one of the aircraft in relation Aircraft. More interestingly, if <09/12, ORD, JFK, John, #222> is in view Work_Schedule and aircraft #222 is in relation Aircraft, then John is also one of the pilots in relation Pilots because of the functional dependency aircraft -> airline. Similarly, if <09/23, BOS, SEA, Sue, #333> is in view Work_Schedule and Sue is a pilot in relation Pilots, then aircraft #333 is also one of the aircraft in relation Aircraft because of the functional dependency pilot -> airline.

Breaking News

  • New Unisys Security Products and Solutions provide advanced protection against enterprise threats
  • European Food Safety Authority chooses Unisys consortium for IT services and support
  • Unisys teams with ServiceNow to deliver new integrated service management solutions

Site Directory