Tuesday, March 2, 2010

Organizing the World's information

"My guess is (it will be) about 300 years until computers are as good as, say, your local reference library in doing search,"says Google's first employee and director of technology Craig Silverstein. "But we can make slow and steady progress, and maybe one day we'll get there." (Inside the Wide World of Google CBS News, March 28, 2004). According to CEO Eric Schmidt, people care a lot about information and the possibilities of the unfolding revolution in technology are greater than many of them even realize: “Imagine the scale of the kinds of questions you could ask that you could not ask before.”

The figure is a great compilation of key Google Facts by PingDom. Among the key technical facts not shown here are far-reaching inventions such as Programmable Search Engine (PSE, see patent application) based on the BigTable database (See Research Publication about it in PDF format) and other systems that could help it become more semantic.

The distributed storage system for managing structured data called Bigtable resembles a database sharing implementation strategies with parallel and main-memory databases. Instead of a full relational data model it uses a simple data model with data indexed using row and column names that can be arbitrary strings. A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes, although clients often serialize various forms of structured and semi-structured data into these strings, controlling it through careful choices in their schemas.

PSE and other integration technologies may be providing a higher level of semantic analysis.
These techniques could figure out the meaning of content and “fill in the blanks” when an item of information is ambiguous or missing. The idea is to enrich an information object with additional tags so that queries about lineage (where something came from) and likelihood of accuracy (the “correctness” of an information element) can be used to generate a result.

Another new concept is a probabilistic mediated schema automatically created from the data sources. Semantic mappings between the schemas of the data sources are mediated by schemas with probabilities attached to each - to model uncertainty at its core. A deterministic mediated schema created from the probabilistic ones will be exposed to the user who could use the terminology of this mediated schema to interact with the system.

The Semantic Web is emerging to help us get the most out of the world's information. Many interesting applications are already here. Some of them already acquired by major search players - Bing, for example, is based on semantic technology from Powerset that Microsoft purchased in 2008. This blog article is only about one of the players organizing the world's information. Stay tuned for more.


Reblog this post [with Zemanta]

No comments :

Post a Comment