Traditional Media Figures Out that Sci-Tech Publishing Is Threatened

September 8, 2008

But do the sci-tech publishers know what’s happening? Navigate to http://www.msnbc.msn.com/id/26512717/ and read “Era of Scientific Secrecy Ends.” The article, written by Robin Lloyd, provides a round up of information changes that will roil the world of sci-tech publishing. Ms. Lloyd does not focus on publishing, but her analysis makes clear that “open science” will bring further changes of sci-tech information access. What’s amazing is that this article comes from two companies struggling to keep pace with similar changes. When I read this good article, I thought of the captain and his officers on the Titanic when the ship was sinking. Insight came too late.

Stephen Arnold, September 8, 2008

Intel: LTU Talks Up Next Generation Processors

September 8, 2008

Update: September 8, 2008, 8 12 am Eastern

More about the Intel quad push is at http://www.yourdesktopinnovation.com/

Original Post

Another item about Intel and search. LTU offers an image processing system that law enforcement professionals find useful in certain matters. But LTU’s technology needs processing horsepower. The company had a deal to embed its image classification technology in a consumer video device, but that was slow out of the gates. The reason, according to my sources, was performance.

At the Intel Developer Forum, LTU showed its image processing system running on Intel’s zippy i7 processor. I can’t keep the names straight anymore, but this processor features more cores on die and more cache plus speed ups for computational intensive applications such as image and content processing.

The crowd loved the demonstration, which should make Intel happy. Search vendors need a way to crank up the performance of their systems. Throwing hardware at search bottlenecks may not be the really smart way to solve problems, but it is one that does not require the search vendors to tackle harder problems such as input output and clunky code in their search systems.

I think Endeca will follow in LTU’s foot steps. Intel is poking around the periphery of search, and the company is going to have to take positive action if it wants to do more than sell chips. My hunch is that smart devices with search and content processing functions on board might be an avenue Intel might investigate.

LTU, in case you are not familiar with the company, is French. The company was Founded in 1999 by software and engineering wizards.  LTU Technologies provides multimedia content control solutions. Its patented technology is use by the French Gendarmerie Nationale, and the Italian state police; agencies investigating traffic in cultural goods and stolen objects (OCBC of the French National Police); as well as commercial media organizations such as Corbis and Meredith Corporation. You can get more information about the company at http://www.ltutech.com.

Stephen Arnold, September 8, 2008

Oracle Teams with ekiwi

September 8, 2008

ekiwi, based in Provo, Utah, has formed a relationship with Oracle. The company was founded in 2002. It focuses on Web based data extraction. The firm’s Screen-Scraper technology is, the news release asserts, “platform-independent and designed to integrate with virtually any existing information technology system.”

The company describes Screen Scraper this way here:

It consists of a proxy server that allows the contents of HTTP and HTTPS requests to be viewed, and an engine that can be configured to extract information from Web sites using special patterns and regular expressions. It handles authentication, redirects, and cookies, and contains an embedded scripting engine that allows extracted data to be manipulated, written out to a file, or inserted into a database. It can be used with PHP, .NET, ColdFusion, Java, or any COM-friendly language such as Visual Basic or Active Server Pages.

Oracle’s revenues are in the $18 to 20 billion range. ekiwi’s revenues may be more modest. Oracle, however, has turned to ekiwi for screen scraping technology to enhance the content acquisition capabilities of Oracle’s flagship enterprise search system, Secure Enterprise Search 10g or SES10g. In May 2008, one of Oracle’s senior executives told me that SES10g was key player in the enterprise search arena and SES10g sold because it was secure. Security, I recall being told, was the key differentiation.

This deal suggests that SES10g has to turn to up-and-coming screen scraping vendors to expand the capabilities of SES10g. I’m still puzzling over this deal, but that’s clearly my inability to understand the sophisticated management thinking that fuels SES10g to its lofty position among the search and content processing vendors.

The news release makes it clear that e-kiwi can access content from the “deep Web”. This buzzword means to me dynamic, database-driven sites. Google has its “deep Web” technologies which may be in part described in its five Programmable Search Engine patents, published by the USPTO as patent applications, in February 2007.

e-kiwi, which offers a very useful Web log here, is:

…a member of the Oracle PartnerNetwork, has worked with Oracle to develop an adaptor that integrates ekiwi’s Screen Scraper with Oracle Secure Enterprise Search to help significantly expand the amount of enterprise content that can be searched while maintaining existing information access and authorization policies. The Oracle Secure Enterprise Search product provides a secure, easy-to-use enterprise search platform that connects to a broad range of enterprise applications and data sources.

The release continues:

The two technologies have already been coupled in a number of cases that demonstrate their ability to work together. In one instance cell phones from many of the major providers were crawled by Screen-Scraper and indexed by Oracle Secure Enterprise Search. A user shopping for cell phones is then able to search, filter, and browse from a single location the various cell phone models by attributes such as price, form factor, and manufacturer. In yet another case, Screen-Scraper was used to extract forum postings from various photography aficionado web sites. This information was then made available through Oracle Secure Enterprise Search, which made it easy to conduct internal marketing analysis on recently released cameras.

I did some poking around and came up short after a quick look at my files and running a couple of Web searches. Information is located, according to the news story about the deal, here. The url is http//:www.screen-scraper.com/ss4ses/. The link redirected for me to http://www.w3.org/Protocols/. The company’s Web site is at http://www.screen-scraper.com, and it looks like this on September 7, 2008, at 8 pm Eastern:

screenscrapersplash

I am delighted that SES10g can acquire Web-based content in dynamic systems. I remain confused about the functions included with SES10g. My understanding was that SES10g was easily extensible, compatible with Oracle Applications, Fusion, and other Oracle technologies. If this were true, SES10g’s ability to pull content from databased services should be trivial for the firm’s engineering team. I was hoping for an upgrade to SES10g, but that seems not to be in the cards at this time. Scraping Web pages seems to be a higher priority that getting a new release out the door. What’s your understanding of Oracle’s enterprise search strategy? I’m confused. Help me out, please.

Stephen Arnold, September 8, 2008

New Beyond Search White Paper: Coveo G2B for Mobile Email Search

September 8, 2008

The Beyond Search research team prepared a white paper about Coveo’s new G2B for Email product. You can download a copy from us here or from Coveo here. Coveo’s system works across different mobile devices, requires no third-party viewers, delivers low-latency access when searching, evidenced no rendering issues, and provided access to contacts and attachments as well as the text in an email. When compared to email search solutions from Google, Microsoft and Yahoo–Coveo’s new service provided a more robust and functional service. Beyond Search identified 13 features that set G2B apart. These include a graphical administrative interface, comprehensive usage reports, and real time indexing of email. The Beyond Search research team—Stephen Arnold, Stuart Schram, Jessica Bratcher, and Anthony Safina–concluded that Coveo established a new benchmark for mobile email search. For more information about Coveo, navigate to www.coveo.com. Pricing information is available from Coveo.

Stephen Arnold, September 5, 2008

Attivio: New Release, Support for 50+ Languages

September 7, 2008

I’m not sure if it’s because Attivio is located less than five miles from Fenway Park and that everyone in that area is, by default, a rabid Sox fan, but I got a preview of a slick new baseball demo they’ve put together to showcase the capabilities of their Active Intelligence Engine (AIE), which is trademarked.

For the upcoming Enterprise Search Summit West in late September, Attivio created a single index that’s composed of more than 700,000 news articles, dating from 2001 to 2007 about baseball. Attivio told me that these were fed into the AIE in XML format. Attivio also processed a dozen comma delimited files that contain baseball statistics such as batting , pitching, player salaries, team information, players post season performances. Here’s the results from my search of steroids.

steroids

© Attivio, 2008

Several aspects of this interface struck me as noteworthy. I liked:

  1. The ability to enter a word or phrase, a SQL query, or a combination “free text” item and a SQL query. Combining the ambiguity of natural language with the precision of a structured query language instruction gives me the type of control I want in my analytic work. Laundry lists don’t help me much. Fully programmatic systems like those from SAS and SPSS are too unwieldy for the fast-cycle work that I have to do.
  2. The point-and-click access to entities, alternative views, and other “facet” functions. Without having to remember how to perform a pivot operation, I can easily view information from structured and unstructured sources with a mouse click. For my work, I often pop between data and information associated with a single individual. The Attivio approach is a time saver, which is important for my work on tight deadlines.
  3. Administrative controls. The Attivio 1.2 release makes it easy for me to turn on certain features when I need them; for example, I can disable the syntax view with a mouse click. When I need to fiddle with my search statement, a click turns the function back on. I can jump to an alerts page to specify what I want to receive automatically and configure other parameters.
  4. Hit highlighting. I want to be able to spot the key fact or passage without tedious scanning.

Read more

Life before Google: History from Gen X

September 7, 2008

When I am in the UK, I enjoy reading the London papers. The Guardian often runs interesting and quirky stories. My newsreader delivered to me “Life before Google” by Kevin Anderson who was in college in the 1990s. Ah. Gen X history. I dived right in, and you may want to read this article here. After a chronological run down of Web search (happily ignoring the pre-Web search systems), Mr. Anderson wrote:

Using the almost 250 year-old theories British mathematician and Presbyterian minister Thomas Bayes, Page and Brin developed an algorithm to analyse the links to a site, helping to predict what sites were relevant to search terms.

This is a comment that is almost certain to catch the attention of Autonomy, the British vendor that has claimed Bayesian methods as its core technology.

Then Mr. Anderson added:

Google hasn’t solved search. There is still the so-called dark web, or deep web – terabytes of data that aren’t searchable or indexed.

Mr. Anderson, despite his keen Gen X intellect, overlooked Google’s Programmable Search Engine inventions or this query on Google. air schedule LGA SFO. The result displayed is

airschedule

What you are looking as is a “deep Web” search result. Mr. Anderson also overlooked the results for Baltimore condo.

The results displayed when I ran this search on September 6, 2008, at 7 10 pm Eastern were:

baltimorecondo

Yep, another “deep Web” search.

What’s the problem with Gen X research for Mr. Anderson’s article? I think for this article it was shallow. Much of the analysis of Google is superficial, incomplete, and misleading in my opinion. Agree or disagree? Help me learn.

Stephen Arnold, September 7, 2008

Personalized Network Searching

September 7, 2008

On September 4, 2008, the USPTO granted US2008/0215553 to Google. The invention is “personalized network searching”. The inventors are Gregory Badros and Stephen Lawrence. In this short post, I want to provide a glimpse of the background of the inventors and then briefly comment on the invention. With the availability of Chrome, Google’s browser, “network searching” becomes more important to me. You, of course, may be indifferent to Google’s “inventions”, but I find them useful windows through which to observe Google engineering at work. A patent does not mean that the inveniton will be used or that it will work, but patents can provide some information about a firm that keeps its lips zipped.

First, who is Stephen Lawrence? He has a low profile, which is not surprising. The biography available on the Queesnland University of Toronto provides some information. You can read the biography here. Some information from that write up suggests that he is a top notch thinker. After getting his PhD, he went to work at the NEC Research Institute in Princeton, New Jersey. He then jumped to Google, where he seems to still work as a Senior Staff Research Scientist. Among the projects on which he has worked at Google are the desktop search application.

Greg Badros is former InfoSpace engineer. He is a graduated of teh University of Washington and a PhD in computer science and engineering. A Duke Unviersity undergraduate, he graduated Magna Cum Laude in 1995. He signed on at Google in 2003. Among his projects were Gmail, calendar, and AdSense. He has received two Google Founders’ Awards and two Executive Management Group awards. You can pick up biographical details here.

These two fellows teamed up in 2003 to work on “personalized network searching.” The patent application resulted in the granting of US2008/0215553 in September 2008.

The abstract for the invention is:

Personalized network searching, in which a search query is received from a user, and a request is received to personalize (a search result. Responsive to the search query and the request to personalize the search result, a personalized search result is generated by searching a personalized search object. Responsive to the search query, a general search result is generated by searching the general search object. The personalized search result and the general search result are provided to – a client device, an advertisement is selected based at least in part upon the personalized search object, and the advertisement, the personalized search result, &d the general search result are displayed.

My reading of this document is that Google uses the user’s bookmarks, search history, annotations, and the query to determine what the user seeks.  The results may be enhanced with a symbol to add information for the user. Users with similar interests could be woven into a community. Users may excitly provide Google with bookmarks, but the invention can pull these items and others from the user’s computing device. The patent document provides a number of examples of how this invention might be used, ranging from pushing information to the user to performing collaborative work. One feature is that if a user doesn’t use bookmarks, the system will monitor what the user does and generate bookmarks based on those actions and data available to the system. The claims include personalization of advertising, information, and interface.

For me the key point is that the membrane or boundary between the user’s personal computer and its data and Google is opened. Whether this makes the user’s computer part of the broader Google computing environment or not depends on how you interpret the language of the patent document. You may find reading the 14 page document interesting. I did. A copy is available from the USPTO here. My view is that Chrome makes this type of Google private network connection easier for the GOOG to control and instrument. I can think of some interesting uses of this technology for intelligence and enterprise applications. What are your thoughts?

Stephen Arnold, September 8, 2008

Text Processing: Why Servers Choke

September 6, 2008

Resource Shelf posted a link to a Hewlett Packard Labs’s paper. Great find. You can download the HP write up here (verified at 7 pm Eastern) on September 5, 2008. The paper argues that an HP innovation can process text at the rate of 100 megabytes per second per processor core. That’s quite fast. The value of the paper for me was that the authors of Extremely Fast Text Feature Extraction for Classification and Indexing” have done a thorough job of providing data about the performance of certain text processing systems. If you’ve been wondering how slow Lucene is, this paper gives you some metrics. The data seem to suggest that Lucene is a very slow horse in a slow race.

Another highlight of George Forman’s and Evan Kirshebaum’s write up was this statement:

Multiple disks or a 100 gigabit Ethernet feed from many client computers may certainly increase the input rate, but ultimately (multi-core) processing technology is getting faster faster than I/O bandwidth is getting faster. One potential avenue for future work is to push the general-
purpose text feature extraction algorithm closer to the disk hardware.  That is, for each file or block read, the disk controller itself could distill the bag-of-words representation and then transfer only this small amount  of data to the general-purpose processor.  This could enable much higher indexing or classification scanning rates than is currently feasible.  Another potential avenue is to investigate varying the hash function to improve classification performance, e.g. to avoid a particularly unfortunate collision between an important, predictive feature and a more frequent word that masks it.

When I read this, two thoughts came to mind:

  1. Search vendors counting on new multi core CPUs to solve performance problems won’t get the speed ups needed to make some systems process content more quickly. Bad news for one vendor whose system I just analyzed for a company convinced that performance is a strategic advantage. In short, slow loses.
  2. As more content is processed and short cuts taken, hash collisions can reduce the usefulness of the value-added processing. A query returns unexpected results. Much of the HP speed up is a series of short cuts. The problem is that short cuts can undermine what matters most to the user–getting the information needed to meet a need.

I urge you to read this paper. Quite a good piece of work. If you have other thoughts about this paper, please, share them.

Stephen Arnold, September 6, 2008

WordLogic, Codima: Entering the Search War

September 6, 2008

WordLogic (Vancouver, BC) and Codima (Edmonton, AB) have teamed in a joint venture to develop Web search technology. Not much information is available on the tie up. Mediacaster Magazine has a short announcement of the deal here. WordLogic has carved a path for itself in mobile device interfaces. Codima is a VoIP specialist. More information about this company is here. Mobile search is attracting interest from Google and Yahoo. Coveo, another Canadian outfit, has a mobile email search service that looks very solid. As more information becomes available about the WordLogic and Codima play, I will pass the information along.

Stephen Arnold, September 6, 2008

Another Google 180

September 6, 2008

Physorg.com ran a story called “Google Chief Admits to Defensive Component of Browser Launch”. You can read the full story here. The point of the story is that Google needed a browser to protect and attack. For me, the most interesting statement in the story was this quote attributed to Eric Schmidt:

It is true that we actually, and I in particular, have said for a long time that we should not do a browser because it wasn’t necessary,” he told the business daily. The thing that changed in the past couple of years … is that people started building powerful applications on top of browsers and the browsers that were out there, in particular in Explorer, were not up to the task of running complex applications.

Now that Google has hit age 10, it seems to be able to changes its mind like a 10 year old. How do I know what Google says today will be true tomorrow? Answer: I don’t. Do you?

Stephen Arnold, September 6, 2008:

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta