Guha Still Going Strong: Spam Prevention in the PSE

September 7, 2010

I still think Ramanathan Guha is a pretty sharp Googler. I met a university professor who did not agree. Tough patooties. Here is the most recent patent application from the guru Guha: US20100223250, “Detecting Spam Related and Biased Contexts for Programmable Search Engines.”

A programmable search engine system is programmable by a variety of different entities, such as client devices and vertical content sites to customize search results for users. Context files store instructions for controlling the operations of the programmable search engine. The context files are processed by various context processors, which use the instructions therein to provide various pre-processing, post-processing, and search engine control operations. Spam related and biased contexts and search results are identified using offline and query time processing stages, and the context files from vertical content providers associated with such spam and biased contexts and results are excluded from processing on direct user queries.

What’s the significance? You will have to wait for one of the azurini to explain Guhaisms. I would note these clues:

  • Context
  • Entities
  • query time processing stages.

But what does an addled goose know? Not much.

Stephen E Arnold, September 7, 2010

Freebie


Twitter: New Monetizing Play?

August 14, 2010

Data and text mining boffins like to crunch “big data.” The idea is that the more data one has, the less slop in the wonky “scores” that fancy math slaps on certain “objects.” Individuals think that his / her actions are unique. Not exactly. The more data one has about people, the easier it is to create some conceptual pig pens and push individuals in them. If you don’t know the name and address of the people, no matter. Once a pig pen has enough piggies in it (50 is a minimum I like to use as a lower boundary), I can push anonymous “users” into those pig pens. Once in a pig pen, the piggies do some predictable things. Since I am from farm country, piggies will move toward chow. You get the idea.

When I read “Twitter Search History Dwindling, Now at Four Days”, I said to myself, “Twitter can charge for more data.” Who knows if I am right, but if I worked at Twitter, I can think of some interesting outfits who might be interested in paying for deep Twitter history. Who would want “deep Twitter history?” Good question. I have written about some outfits, and I have done some interviews in Search Wizards Speak and the Beyond Search interviews that shed some light on these folks.

What can a data or text miner do with four days’ data? Learn that he / she needs a heck of a lot more to do some not-so-fuzzy mathy stuff.

Stephen E Arnold, August 14, 2010

Freebie.

Cloud and Context: Fuzzy and Fuzzier

August 11, 2010

I got a kick out of “Gartner Says Relevancy of Search Results Can be Improved by Combining Cloud and Context-Aware Services.” Fire up your iPad and check out this write up which has more big ideas and insights than Leonardo, Einstein, or Andy Rooney ever had. You will want to read the full text of the article. What I want to do is list the memes that dot the write up like chocolate chips in Toll House cookies. Here goes:

  • value
  • cloud-based services
  • context-based services
  • revenue facing external search installation
  • informational services
  • integration engineers
  • contextual information
  • value from understanding
  • Web search efforts
  • market dynamics
  • general inclination
  • search in the cloud
  • discoverable information
  • offloading
  • quantifiable improvements
  • social networking
  • user’s explicit statement of interests
  • rich profile data

Cool word choice, right? Concrete. Specific. Substantive. Now here’s the sentence that I was tempted to flag as a quote to note. I decided to include it in this blog post:

Optimizing search through the effective application of context is a particularly helpful and effective way to deliver valuable improvements in results sets under any circumstances.

Got that? Any circumstance. Well, except being understandable to an addled goose.

Stephen E Arnold, August 11, 2010

Freebie

Six Semantic Vendors Profiled

August 9, 2010

I saw in my newsreader this story: “Introducing Six Semantic Technology Vendors: Strengthening Insurance Business Initiatives with Semantic Technologies.” The write up is a table of contents or a flier for a report prepared by one of the azurini with a focus on what seems to be “life and non life insurance companies.”

For me the most interesting snippet in the advertisement was this sequence, which I have formatted to make more readable.

Attivio offers a common access platform combining unstructured and structured content [Note: one of Attivio’s founders has left the building. No details.]

Cambridge Semantics wants to help companies quickly obtain practical results [Note: more of a business intelligence type solution.]

Lexalytics has a ‘laser-focus’ on sentiment analysis. [Note: lots of search and content processing in a Microsoft centric wrapper.]

Linguamatics finds the nuggets hidden in plain sight. [Note: the real deal with a core competency in pharmaceuticals which I suppose is similar to life and non life insurance companies.]

MetaCarta identifies location references in unstructured documents in real-time. [Note: a geo tagging centric system now chased by outfits like MarkLogic, Microsoft, and lots of others]

SchemaLogic enables information to be found and shared more effectively using semantic technologies. [Note: I thought this outfit managed metatags across an enterprise. At one time, the company was focused on Microsoft technology. Today? I don’t know because when one of the founders cut out, my interest tapered off.]

The list and its accompanying prose are interesting to me for three reasons:

First, the descriptions of these firms as semantic does not map to my impression of the six firms’ technologies. I am okay with the inclusion of Cambridge Semantics and Linguamatics but I am not in sync with the azurini who plopped the other four outfits in the list. I think I can dredge up an argument to include these four firms on a content processing list, but gung-ho semantic technology. Nope.

Second, the link pointed me to a reseller of market research. The hitch in the git along for me was that the landing page did not point to the report. When I ran a query for “semantic technology vendors” I saw this message: “Sorry, no reports matching your search were found. For personal search assistance, please send us a request at contact@aarkstore.com.”

Third, the source of the report did not jump off the page at me. In short, what the heck is this document? How much does it cost? How can anyone buy it if the vendor’s search system doesn’t work and the write up on the Moso-technology.com Web site is fragmented.

I can’t recommend buying or not buying the report. Too bad.

Stephen E Arnold, August 9, 2010

Minority Report and Reality: The Google and In-Q-Tel Play

August 9, 2010

Unlike the film “Minority Report”, predictive analytics are here and now. More surprising to me is that most people don’t realize that the methods are in the cateogry of “been there, done that.”

I don’t want to provide too much detail about predictive methods applied to military and law enforcement. Let me remind you, gentle reader, that using numerical recipes to figure out what is likely to happen is an old, old discipline. Keep in mind that the links in this post may go dead at any time, particularly the link to the Chinese write up.

There are companies who have been grinding away in this field for a long time. I worked at an outfit that had a “pretzel factory”. We did not make snacks; we made predictions along with some other goodies.

In this blog I have mentioned over time companies who operate in this sector; for example, Kroll, recently acquired by Altegrity and Fetch Technologies. Now that’s a household name in Sioux City and Seattle. I have even mentioned a project on which I worked which you can ping at www.tosig.com. Other hints and clues are scattered like wacky Johnny Appleseed trees. I don’t plan on pulling these threads together in a free blog post.

image

© RecordedFuture, 2010. Source: http://www.analysisintelligence.com/

I can direct your attention to the public announcement that RecordedFuture has received some financial Tiger Milk from In-Q-Tel, the investment arm of one of the US government entities. Good old Google via its ventures arm has added some cinnamon to the predictive analytics smoothie. You can get an acceptable run down in Wired’s “Exclusive: Google, CIA Invest in ‘Future’ of Web Monitoring.” I think you want to have your “real journalist” baloney detector on because In-Q-Tel invested in RecordedFuture in January 2010, a fact disclosed on the In-Q-Tel Web site many moons ago. RecordedFuture also has a Web site at www.recordedfuture.com, rich with marketing mumbo jumbo, a video, and some semi-useful examples of what the company does. I will leave the public Web site to readers with some time to burn. If you want to get an attention deficit disorder injection, here you go:

The Web contains a vast amount of unstructured information.  Web users access specific content of interest with a variety of  Websites supporting unstructured search.  The unstructured search approaches clearly provide tremendous value but are unable to address a variety of classes of search.   RecordedFuture is aggregating a variety of Web-based news and information sources and developing semantic context enabling  more structured classes of search.  In this presentation, we present initial methods for accessing and analyzing this structured content.   The RJSONIO package is used to form queries and manage response data.  Analytic approaches for the extracted content include normalization and regression approaches.  R-based visualization approaches are complemented with data presentation capabilities of Spotfire.

Read more

Taxodiary: At Last a Taxonomy News Service

August 3, 2010

I have tried to write about taxonomies, ontologies, and controlled term lists. I will be the first to admit that my approach has been to comment on the faux pundits, the so-called experts, and the azurini (self appointed experts in metatagging and indexing). The problem with the existing content flowing through the datasphere is that it is uninformed.

What makes commentary about tagging informed? Three attributes. First, I expect those who write about taxonomies to have built commercially-successful systems to manage terms lists and that those term lists are in wide use, conform to standards from ISO, ANSI, and similar outfits. Second, I expect those running the company to have broad experience in tagging for serious subjects, not the baloney that smacks of search engine optimization and snookering humans and algorithms with their alleged cleverness. Third, I expect the systems used to build taxonomies, manage classification schemes, and term lists to work; that is, a user can figure out how to get information out of a system relevant to his / her query.

taxodiary splash

Splash page for the Taxodiary news and information service.

How rare are these attributes?

Darned rare. When I worked on ABI/INFORM, Business Dateline, and the other database products, I relied on two people to guide my team and me. The first person is Betty Eddison, one of the leaders in indexing. May she rest in indexing heaven where SEO is confined to Hell. Betty was one of the founders of InMagic, a company on whose board I served for several years. Top notch. Care to argue? Get ready for a rumble, gentle reader.

The second person was Margie Hlava. Now Ms. Hlava, like Ms. Eddison, is one of the top guns in indexing. In fact, I would assert that she is on my yardstick either at the top or holds the top spot in this discipline. Please, keep in mind that her company Access Innovations and her partner Dr. Jay ven Eman are included in my reference to Ms. Hlava. How good is Ms. Hlava? Very good saith the goose.

Read more

IBM OmniFind 9.1: Trouble for Some Search Partners?

August 2, 2010

IBM has embraced open source. Now before you wade through the links for the new IBM OmniFind 9.1 search system, let me own up to a previous error. I did not believe that IBM would do much to make open source search a key part of the firm’s software strategy. I was wrong. IBM did or people like Mike McCandless did. Second, the decision to use Lucene and wrap IBM’s product strategy and pricing around it pretty much means that some of IBM’s favored enterprise search vendors are going to find themselves sitting home when IBM makes certain sales calls. Third, the IBM pricing strategy does not mean that enterprise search IBM-style is free. The idea is that IBM will be able to chase after Microsoft without the legacy of the $1.3 billion investment in Fast Search & Transfer, the legal and police muddle, and the mind boggling task of converting Fast into the broader vista of SharePoint. (Do you think my reference to “vista” evokes the Windows 7 predecessor? Silly you.)

Here’s what we have based on my poking around.

You get to license connectors. These puppies will be saddled with IBM pricing elements. This means that it will be tough for a customer to compare what he/she paid with what another customer paid. Bad for competitors too, but that’s a secondary issue compared to generating revenue. Run a query for part number BFG04CML. The adapters work with the UIMA standard.

You get to pay for the multi language option. Same pricing deal as connectors.

There is an email search component. which is available as “IBM OmniFind Personal E-Mail Search or IOPES. This works with Lotus Notes and Microsoft Outlook. IBM sales engineers may be able to bundle up the bits and pieces needed to stop outfits like the not well known Isys Search Software outfit from Australia from selling search to a Lotus Notes’ customer.

The security model reminds me of Oracle’s SES11g approach. You get a system and then get to buy components. Same pricing model again.

You can license a classification model. Same pricing mechanism.

If you already have an OmniFind search installation, you have to reindex after working through the update procedure. That sequence is too complex for a blog post, and if anyone wants a summary, I charge for it. The darned method was not particularly easy to locate on the IBM Web site. Sorry, I run a business.

You can still handle collections, but you have to set these up via the administrative interface or the configuration files.

If you have a bunch of IBM servers running OmniFind, you have to update each one in the search system. Have fun.

There is a Web crawler available, and I think our test showed that it called itself UFOcrawler.

For more information about OmniFind 9.1 click this link. Be patient. The new color is green which evokes the cost of the add ons and components. Nevertheless, bad news for some commercial search and content processing vendors accustomed to IBM’s throwing them bones. IBM is now eating those bones in my opinion. The sauce is open source. Tasty too.

Stephen E Arnold, July 30, 2010

Exclusive Interview: Eric Gries, Lucid Imagination

July 27, 2010

It’s not everyday that you find a revolutionary company like Lucid Imagination that’s blazing a new trail in the Open Source world whose CEO described the firm as being at 90 degrees to the traditional search business model.

Still, that’s the way that Eric Gries refers to Lucene/Solr’s impact on the search and content processing market. “The traditional search industry has not changed much in 30 years. Lucid Imagination’s approach is new, disruptive, and able to deliver high value solutions without the old baggage. We have flipped the old ideas of paying millions and maybe getting a solution that works. We provide the industrial strength software and then provide services that the client needs. The savings are substantial. Maybe we are now taking the right angle?”, he asked with a big smile?”

This pivot in the market reflects the destabilizing impact of open source search, and the business that Mr. Gries is building at supersonic speeds. “Traditional search is like taking a trip on a horse drawn cart. Lucid Imagination’s approach is quick, agile, and matched to today’s business needs.”

A seasoned executive in in software and information management, Mr. Gries uses the phrase to capture his firm’s meteoric rise in the Open Source world and how the success of its Open Source model is giving traditional competitors such as Autonomy, Endeca, and Microsoft Fast indigestion.

Mr. Gries’s background speaks of the right pedigree for a professional who is at the helm of a successful startup.

He got his start at Cullinet Software. “I started in the computer sciences and joined my first company as part of the development team,” Cullinet Software was an early leader–databases were young, and relational databases, made famous by Larry Ellison, were just getting out of the gate,” he recently told Beyond Search. After he got his MBA, he moved more into the business side, among other things, building the Network System Management Division at Compuware. He’s brought solid credentials in software services from his experience to the new venture at Lucid Imagination, a start up with substantial venture backing.

eric_head1

Eric Gries, the mastermind of the Lucene Revolution. Source: Lucid Imagination, www.lucidimagination.com

He was first attracted to search and data and the relevant issues there. The lure of Open Source came later.

“The thing that attracted me to Open Source at first was the fact that search was really growing in leaps and bounds,” he said and he’s understandably proud of what the company has been able to accomplish so far.

“Lucene/Solr is software that is as good or better than most of the other commercial offerings in terms of scalability, relevance and performance.”

He talked recently about how it was important to him to put together the right kind of advisory guidance, drawing on people with real world experience in the technology and business of Open Source.

“I was new to the space, so very early on I put together a very strong advisory board of Open Source luminaries that were very helpful.”

Lucid Imagination, of which Gries is President and CEO, was launched in 2009, and is only in its second year of operation. Lucid closed millions of dollars worth of business in their first year. The recipe for success that includes a deep level of involvement and collaboration with the community outside Lucid, communities and and ensuring the technology gets the right kind of attention in terms of vital needs like quality and flexibility that drive the appetite of organizations for search technology.

The value of the business is about search, not open source. The company is riding on the trends of search and Open Source which Mr. Gries says is being accepted more and more as a mainstay of the enterprise.

The establishment has taken notice as well with companies who understand the value of trailblazers like Red Hat — ‘opening doors’ for Lucene/Solr according to Mr. Gries — and in turn helping them to establish themselves as a second generation supplier of Open Source technology solutions.

Mr. Gries’s enthusiasm for his new type of business model is infectious and he enjoys pointing out the pride and dedication that goes into the work that gets done at Lucid, located in the heart of Silicon Valley.

“We added low cost to the metrics of scalability, relevance and performance so there’s really no good reason to use any commercial software with all due respect,” he added.

One of the more interesting aspects of Lucid is the fact that the firm has received $16 million in venture funding and is already getting an impressive list of clients on their roster that includes names like LinkedIn, Cisco, and Zappos, now a unit of the giant Amazon.

It’s clear that Mr. Gries has been able to understand that Open Source has been able to displace some commercial search solutions, and for him the reasons are simple that the software is downloaded at the blistering pace of thousands of units a day.

“Now the software is good and industry sees that there is a commercial entity committed to working with them, we want the enterprise to see they can work with Open Source,” he noted.

Still, while there is what’s been described as considerable momentum among some developers for this technology, some senior information technology managers and some purchasing professionals are less familiar with Open Source software and Lucene/Solr.

That’s where Mr. Gries understands the need to get the word out on the firm. He has learned that the education of the market is critical and hopes to build on the successes that Lucid Imagination achieved with sponsoring a developer conference in Prague earlier this year . It was so successful another—the Lucene Revolution— is planned for Boston in October.

Mr. Gries prides himself on the fact that the products created are all about a fresh business model with no distance between the developer and user. He’s proud of the success that the Lucene/Solr technology and community, along with his company, have enjoyed so far and likes to point out in his own way that one of their biggest goals beyond added value is increasing their market exposure for what he call this second generation Open Source.

“We are at 90 degrees to the typical search business model. We’re disruptive. We are making the competition explain a business model that is not matched to today’s financial realities. The handcuffs of traditional software licenses won’t fit companies that need agility and high value solutions,” he said. “ The software is already out there and running mission critical solutions. One of our tasks to to make sure people understand what is available now, and the payoffs available right now.”

He points to the fact that success came so early for Lucene/Solr the company has just put their 24/7 customer service in place.

Open source means leveraging a community. Lucid combines the benefits of open source software with exceptional support and service. For more information about the company, its Web site is at www.lucidimagination.com.

Stephen E Arnold, July 27, 2010

I have been promised a free admission to the Lucene Revolution in October 2010.

Exclusive Interview: Mike Horowitz, Fetch Technologies

July 20, 2010

Savvy content processing vendors have found business opportunities where others did not. One example is Fetch Technologies, based in El Segundo, California. The company was founded by professors at the University of Southern California’s Information Sciences Institute. Since the firm’s doors opened in the late 1990s, Fetch has developed a solid clientele and a reputation for cracking some of the most challenging problems in information processing. You can read an in-depth explanation of the Fetch system in the Search Wizards Speak’s interview with Mike Horowitz.

The Fetch solution uses artificial intelligence and machine learning to intelligently navigate and extract specific data from user specified Web sites. Users create “Web agents” that accurately and precisely extract specific data from Web pages. Fetch agents are unique in that they can navigate through form fields on Web sites, allowing access to data in the Deep Web, which search engines generally miss.

You can learn more about the company and its capabilities in an exclusive interview with Mike Horowitz, Fetch’s chief product officer. Mr. Horowitz joined Fetch after a stint at Googler.

In the lengthy discussion with Mr. Horowitz, he told me about the firm’s product line up:

Fetch currently offers Fetch Live Access as an enterprise software solution or as a fully hosted SaaS option. All of our clients have one thing in common, and that is their awareness of data opportunities on the Web. The Internet is a growing source of business-critical information, with data embedded in millions of different Web sites – product information and prices, people data, news, blogs, events, and more – being published each minute. Fetch technology allows organizations to access this dynamic data source by connecting directly to Web sites and extracting the precise data they need, turning Web sites into data sources.

The company’s systems and methods make use of proprietary numerical recipes. Licensees, however, can program the Fetch system using the firm’s innovative drag-and-drop programming tools. One of the interesting insights Mr. Horowitz gave me is that Fetch’s technology can be configured and deployed quickly. This agility is one reason why the firm has such a strong following in the business and military intelligence markets.

He said:

Fetch allows users to access the data they need for reports, mashups, competitive insight, whatever. The exponential growth of the Internet has produced a near-limitless set of raw and constantly changing data, on almost any subject, but the lack of consistent markup and data access has limited its availability and effectiveness. The rise of data APIs and the success of Google Maps has shown that there are is an insatiable appetite for the recombination and usage of this data, but we are only at the early stages of this trend.

The interview provides useful insights into Fetch and includes Mr. Horowitz’s views about the major trends in information retrieval for the last half of 2010 and early 2011.

Now, go Fetch.

Stephen E Arnold, July 20, 2010

Freebie. I wanted money, but Mr. Horowitz provided exclusive screen shots for my lectures at the Special Library Association lecture in June and then my briefings in Madrid for the Department of State. Sigh. No dough, but I learned a lot.

Google Metaweb Deal Points to Possible Engineering Issue

July 19, 2010

Years ago, I wrote a BearStearns’ white paper “Google’s Semantic Web: the Radical Change Coming to Search and the Profound Implications to Yahoo & Microsoft,” May 16, 2007, about the work of Epinions’ founder, Dr. Ramanathan Guha. Dr. Guha bounced from big outfit to big outfit, landing at Google after a stint at IBM Almaden. My BearStearns’ report focused on an interesting series of patent applications filed in February 2007. The five patent applications were published on the same day. These are now popping out of the ever efficient USPTO as granted patents.

A close reading of the Guha February 2007 patent applications and other Google technical papers make clear that Google had a keen interest in semantic methods. The company’s acquisition of Transformics at about the same time as Dr. Guha’s jump to the Google was another out-of-spectrum signal for most Google watchers.

With Dr. Guha’s Programmable Search Engine inventions and Dr. Alon Halevy’s dataspace methods, Google seemed poised to take over the floundering semantic Web movement. I recall seeing Google classification methods applied in a recipe demo, a headache demo, and a  real estate demo. Some of these demos made use of entities; for example, “skin cancer” and “chicken soup”.

image

Has Google become a one trick pony? The buy-technology trick? Can the Google pony learn the diversify and grow new revenue tricks before it’s time for the glue factory?

In 2006, signals I saw flashed green, and it sure looked as if Google could speed down the Information Highway 101 in its semantic supercar.

Is Metaweb a Turning Point for Google Technology?

What happened?

We know from the cartwheels Web wizards are turning, Google purchased computer Zen master Danny Hillis’ Metaweb business. Metaweb, known mostly to the information retrieval and semantic Web crowd, produced a giant controlled term list of people, places, and things. The Freebase knowledgebase is a next generation open source term list. You can get some useful technical details from the 2007 “On Danny Hillis, eLearning, Freebase, Metaweb, Semantic Web and Web 3.0” and from the Wikipedia Metaweb entry here.

What has been missing in the extensive commentary available to me in my Overflight service is some thinking about what went right or wrong with Google’s investments and research in closely adjacent technologies. Please, keep in mind that the addled goose is offering his observations based on his research for this three Google monographs, The Google Legacy, Google Version 2.0, and Google: the Digital Gutenberg. If you want to honk back, use the comments section of this Web log.

First, Google should be in a position to tap its existing metadata and classification systems such as the Guha context server and the Halevy dataspace method for entities. Failing these methods, Google has its user input methods like Knol and its hugely informative search query usage logs to generate a list of entities. Heck, there is even the disambiguation system to make sense of misspellings of people like Britney Spears. I heard a Googler give a talk in which the factoid about hundreds of variants of Ms. Spears’s name were “known” to the Google system and properly substituted automagically when the user goofed. The fact that Google bought Metaweb makes clear that something is still missing.

Read more

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta