Yahoo Cost Estimate
July 11, 2008
I wanted to run through some of the cost data I have gathered over the years. The reason is this sentence in Miguel Helft’s “Yahoo Is Inviting Partners to Build on Its Search Power,” an essay that appeared in the Kentucky edition of the New York Times, July 10, 2008, page C5:
Yahoo estimates that it would cost $300 million to build a search service from scratch.
No link for this. Sorry. I have the dead tree version, and I refuse to deal with the New York Times’s Web site, and its weird reader thing.
The Yahoo BOSS initiative has been choking my news reader. I don’t want to be a link pig, but I will flag three posts that you may want to scan. First, the LA Times’s “Who’s the BOSS? Yahoo Searches for a Way to Unseat Google,” by Jessica Guynn. You can as of 7 45 pm on July 10, 2008, read it here. I liked this write up because of this remark:
Yahoo has made myriad efforts over the years.
By golly, that nails it. Lots of effort, little progress. The rest of Ms Guynn’s essay unrolls a well worn red carpet decorated with platitudes.
Next, I suggest you scan Larry Dignan’s essay “Yahoo’s Desperate Search Times Call for Open Source.” I like most of the ZDNet essays. I would characterize the approach as gentle pragmatism. I liked this sentence:
Yahoo’s open strategy makes a lot of sense. But let’s not kid ourselves, Yahoo’s open strategy could be characterized as a Hail Mary pass too. It may work. BOSS may turn out to be brilliant. But let’s reserve judgment until we see some results–on the business and technology fronts.
Nailed. Enough said.
The last essay on this short list is John Letzing’s “In an Effort to Disrupt, Yahoo Further Opens Search” on MarketWatch. You can read this article here. (Warning: MarketWatch essays can be tough to track down. Very wacky url and a not-so-hot search engine make a killer combination.) The essay is good, and it takes a business angle on the story. For me, this was the key sentence:
Yahoo distributed a slide presentation to accompany news of the BOSS initiative that includes a pie chart showing a dramatic projected gain for “BOSS partners & developers,” at the expense of Google, Microsoft and Yahoo-branded services. Michels stressed that the pie chart isn’t based on actual calculated estimates, but rather reflects Yahoo’s directional goals.
Presentations based on assumptions–those will go a long way to restoring investor confidence in Yahoo.
Now back to the single sentence in the New York Times today:
Yahoo estimates that it would cost $300 million to build a search service from scratch.
This is Yahoo math.
My data suggest that Yahoo’s estimate is baloney. Over the years, Yahoo has accumulated search technologies; for example, Inktomi, AllTheWeb.com, Stata Labs, and AltaVista.com. Yahoo’s acquisitions arrived with search systems, often pretty weak; for example, Delicious.com’s and Flickr.com’s. Yahoo has licensed third-party search tools such as InQuira’s question answering system. To top it off, Yahoo’s engineers have cooked up Mindset, which has some nice features, and the more recent semantic search system here.
This $300 million number is low enough for a company of Yahoo’s size to have built a search system if it could be done. The wacky estimates and the track record of collecting search system like the hopeful’s on Antique Road Show are evidence that Yahoo could not build a search system.
Yahoo could spend time, money, and talent creating a collection of stuff that has zero chance of thwarting Google. The search vendors lining up to use Yahoo’s index and infrastructure, the open source voodoo, and the unsubstantiated cost estimate underscore how far from reality Yahoo has allowed itself to drift.
I am going to watch how the BOSS play unfolds. Yahoo is in a pretty unpleasant spot, and its executives’ willingness to do first year MBA student projections annoys me.
Let me end with a question. If search is a $300 million dollar investment, for what is Google spending billions? Why is Microsoft spending moe billions than Google AND buying search technology with a devil-may-care insouciance that I admire. It is as if Carly Fiorina was the buy out guru.
Yahoo’s ad revenue projections and its cost estimates are examples of spreadsheet fever. I hope the disease runs its course before the patient becomes incurable.
There’s a math cartoon floating around. The letter “i” (Descartes’ imaginary number) is talking to pi (the Greek symbol you recall from 7th grade math). The caption is, “Get real.” Good advice. Those writing about Yahoo may want to pepper their questions with “Get real.”
Stephen Arnold, July 11, 2008
Artificial Intelligence: Once Again Safe to Use the Term
July 11, 2008
For years, I have been reluctant to use the phrase “artificial intelligence”. I danced around the subject with “computational intelligence,” “smart software,” and “machine intelligence.” Google aced me with its use of the term “janitors” to refer to a smart factory that generated smart digital robots to clean up messes left from other data processes.
Now, the highly-regarded Silicon.com has made it okay for me to say, “artificial intelligence” and use the acronym “AI” without fear of backlash. Tim Ferguson has an essay “Artificial Intelligence–Alive and Kicking” built in part upon an interview with Professor Nigel Shadbolt, head of artificial intelligence at Southampton University.
The point of the essay is that AI continues to thrive without the wackiness that accompanied the hype from a decade ago. Examples of practical AI are voice recognition (my phone often doesn’t understand me when I am driving) and vision processing software (works okay for certain types of object recognition but not others).
The key point in the essay was this statement attributed to Profession Shadbolt:
What we’re seeing with the web is the way in which it can bring those things that computers are good at in co-ordination with what people are good at. You use people’s innate intelligence and ability, you connect them up on planetary scale and you’ve got a new kind of assisted intelligence. It isn’t an AI because it’s not in any way self aware – but it’s a phenomenal, powerful thing.
Bingo. What people are calling social search, collaboration, and intelligent systems is a mash up. I quite like the phrase “assisted intelligence.” Software can be more intelligent when the inputs and outputs of humans are factored into the probabilities used by algorithms to “decide”.
I will promptly co- opt the phrase “assisted intelligence”. I will give Messrs Ferguson and Shadbolt credit in this essay. I know that subsequent uses will be less disciplined about giving credit where credit is due. “Assisted intelligence” is a useful coinage.
I would like to offer three observations, which is my prerogative in my own personal Web log:
- Artificial or assisted intelligence is going to require a heck of a lot of resources, particularly if the volume of digital information continues to go up. How many companies will have the appetite to craft a large-scale system. Certainly police and intelligence authorities, companies like Google and Microsoft, and giant multi-nationals like big pharma.
- The spectrum of AI applications will range from the mundane (my thermostat adjusting itself to keep my environment a constant 72 degrees Fahrenheit to the exotic (the aforementioned janitors doing the work of human subject matter experts inside Google’s data centers). At the same time we become indifferent to AI, some applications will make headlines. There will be some debate of artificial and assisted intelligence going forward.
- AI (both assisted and artificial) will disintermediate some people along the way. Life will be good for wizards and rocket scientists. Life will not be so good for those displaced; for example, why would a start up publisher want to use the job descriptions for a traditional printed newspaper publisher. Better to trim the staff, focus on software, and keep the costs low and the margins as high as possible.
AI is back. I don’t think it ever left. The media veered into more trendy subjects. Let the applications flow.
Stephen Arnold, July 11, 2008
Hakia to Accelerate Semantic Analysis of the Web
July 10, 2008
A somewhat bold headline hopped from my news reader screen this morning (July 10, 2008). A news release from Hakia, one of the players in the semantic search football match, told me: “Hakia Leverages Yahoo Search BOSS to Accelerate Its Semantic Analysis of the World Wide Web.” You can get a copy of this release from Farrah Hamid (farrah at hakia dot com). As of 8 50 am, the news release is not on the Hakia Web log nor is there a link to this Hakia announcement.
The key point in the news release is that Hakia is using Yahoo’s Build Your Own Search Service or BOSS. The idea is that Hakia will use Yahoo’s search infrastructure to “accelerate Hakia’s crawling of the Web to identify quality documents for semantic analysis using its advanced QDEX (Query Detection and Extraction) technology. The “its” refers to Hakia’s patented technology, not Yahoo’s BOSS service.
Using Yahoo makes sense for two reasons. First, scaling to index Web content is expensive, a fact lost on many search mavens who don’t have a sense of the economics of content processing. Second, Yahoo’s BOSS makes it reasonably easy to tap into Yahoo’s plumbing. I wondered by other semantic search vendors have not looked at this type of hook up to better demonstrate the power of their systems. A couple of years ago, Siderean Software processed the Delicious.com content, and I found that a particularly good demo of the Siderean technology as well as providing me with a very useful resource. I have lost track of Siderean’s Delicious index, so I will need to do a bit of sleuthing later today.
Also, you can refresh your recollection of BOSS at http://www.developer.yahoo.com/boss. While you are at the Yahoo site, check out Yahoo’s own semantic search system, which left me a trifle disappointed. This system is shod with this url http://www.yr-bcn.es/demos/microsearch/. My write up about yr-bcn is here. One hopes the Hakia system raises the bar for Yahoo-based semantic efforts. It would be useful if Hakia puts up a head-to-head comparison of its system compared to Yahoo’s. You can see the Hakia comparison with Google here.
The choice of the BOSS service is understandable. Yahoo these days seems pliable. Cutting a deal with Google is fuzzy, often depending on which Googler one tracks down via email or at a conference. In my opinion, Google has been playing hardball in the semantic space. I am starting to think Google has designs on jump starting the semantic search “revolution” and putting its own systems and methods in place. The semantic Web certainly has not taken off, so why not entertain the notion of Google as the Semantic Web? Makes sense to me.
Microsoft, fresh from its hunt for semantic technology, is a big outfit, so it is also difficult to find an “owner” of the task a company like Hakia wants to use. Microsoft can put a price tag on accessing its index, which one cheery Redmonian told me now contained 25 billion Web pages. I told the Redmonian, “My tests suggest that the index is in the 5 to 7 billion page range.” I was told that I was an addled goose. So, what’s new.
Yahoo–troubled outfit that it is–probably welcomes an opportunity to allow Hakia to get the portal some positive media coverage. But if I had been advising Hakia (which I am not), I would have suggested Hakia give Exalead in Paris, France, a jingle. Exalead’s Web index is fresh, contains eight billion or so Web pages, and its engineers are quite open to new ideas. Yandex also might have made my list partners.
Check out the Hakia system at http://www.hakia.com. When I get additional information, I will try to update this post.
Stephen Arnold, July 10, 2008
Update: July 10, 2008, 10 am: My Hakia post is part of a larger fabric of Yahoo BOSS coverage. You will want to read “Yahoo Radically Opens Web Search with BOSS” in the July 9, 2008, TechCrunch. Mark Hendrickson’s coverage is a very good summary of the information on Yahoo’s Web site. He also takes a positive stance, noting “BOSS is the second concrete product to come out of Yahoo’s Open Strategy. The first was Search Monkey back in April [2008].” I am not ready to even think about being positive. These types of announcements are coming when the firm is in disarray. Any announcement, therefore, may be moving deck chairs on the Titanic. I will take a more skeptical position and say, “Let’s see how this plays out.” Yahoo is in flux, and its own semantic search system, referenced in the essay above, is not too good.
Update 2, July 10, 2008 10 10 am Eastern time: Hakia provided this information to me just a few moments ago.
- The news release is on the Hakia Web site at http://company.hakia.com/pr-070308.html. Don’t forget the dots. (How about an explicit link on the splash page, Hakia?)
- You can find other Hakia news releases at this location http://company.hakia.com/press.
- The “official” Yahoo release is here: This url is too crazy to reproduce.
Ballmer Wants Cool Stuff from Microsoft
July 10, 2008
The Houston Chronicle’s Brad Hem wrote an essay that caught my attention just as I was heading to the log cabin in Harrod’s Creek to catch 40 winks. The title did it: “CEO: Microsoft Needs to Do More Cool Stuff.” The full text is here. The Houston Chronicle, fine paper that it is, struggles with search and its content management system. If you get an 404, good luck finding Mr. Hem’s must read story.
For me, the most interesting point was:
He [Mr. Ballmer] disputed the idea that Apple or Google is cooler than Microsoft.
Upon reading this sentence, the following thoughts flapped through my mind:
- The local mall is struggling to keep those wanting to buy iPhones out of the parking lot. I don’t recall any problem with squatters at the mall when Vista was released. I guess the 3G iPhone is uncool.
- Google snags 80 percent of the Web search traffic in Germany. Nah, this type of market penetration is definitely not cool.
- I received two emails with photos of a next-generation Macbook with an aluminum case. Again, who really cares? Not cool.
- Google rolls out a Second Life clone, its own version of XML to address transformation hassles, and posts on its corporate Web log a user’s method for making Google Docs work like a Web log editor. Again, opposite of cool. (I actually thought this was pretty nifty, but nifty is not cool.)
I look forward to some cool stuff from Microsoft; for example, more money paid to me for using Microsoft Web sites, an Xbox discount, a new version of SQL Server that really does complete back ups, and inclusion of DNABlueprint in the MSDN Web site.
Stephen Arnold, July 10, 2008
Stephen Arnold, July 10, 2008
SQL Server: Bringing the Plow Horse to the Race Track for the Derby
July 10, 2008
SQL Server has bought a lot of dog food in Harrod’s Creek. We got paid to figure out why SQL Server back up and replication crashed and burned. We got paid to make SQL Server go faster. We got paid to grunt through scripts to figure out why reports were off by one. Yep, we like that plow horse. It works like a champ for most business database needs. You can use Access as a front end. You can make some nice looking forms with Microsoft tools with some fiddling.
This is a Microsoft diagram. The release date is August, maybe September 2008. More information is here.
But, when the old plow horse has to amble through petabytes of data, SQL Server is not the right animal for the job. In order to search gigabytes of normalized tables, you need to find a way to short cut the process. One of my colleagues figure out a way to intercept writes, eject them, and build a shadow index that could be searched using some nifty methods. Left to its own devices, SQL Server would stroll through processes, not gallop.
I spoke with a skeptic today. Her comments caused me to think about SQL Server in a critical way. Are these points valid? Let’s follow the plow horse idea and see if there’s hay in the stall.
Selected Features
Like she said to me, “A different data management animal is needed, right?”
Will SQL Server 2008 be that beast? Here’s what she told me about the most recent version of this data work horse:
- An easier to use report builder. I thought the existing report tools were pretty spiffy. Guess I was wrong.
- Table compression. A good thing but the search still takes some time. Codd databases have their place, but the doctor did not plan for petabyte-scale tables, chubby XML tables, and the other goodies that modern day 20-somethings expect databases to do.
- More security controls. Microsoft engineers are likely to spark some interest from Oracle, a company known for making security a key part of its database systems.
- Streamlined administrative controls. Good for a person on a salary. Probably a mixed blessing for SQL Server consultants.
- Plumbing enhancements. We like partitioned table parallelism because it’s another option for whipping the plow horse.
These are significant changes, but the plow horse is still there, she asserted. She said, “You can comb the mane and tail. You can put liquid shoe polish on the hooves. You can even use a commercial hair conditioner to give the coat a just groomed look. But it is still a plow horse, designed to handle certain tasks quite well.”
Microsoft’s official information page is here. You can find useful links on MSDN. I had somewhat better luck using Google’s special purpose Microsoft index. Pick your poison.
Observations
If you are Microsoft Certified Professional, you probably wonder why I am quoting her plow horse analogy. I think SQL Server 2008 is a vastly improved relational database. It handles mission critical applications in organizations of all sizes 24×7 with excellent reliability when properly set up and resource. Stop with the plow horse.
Let’s shift to a different beast. No more horse analogies. I have a sneaking suspicion that the animal to challenge is Googzilla. The Web search and advertising company uses MySQL for routine RDBMS operations. But for the heavy lifting, Googzilla has jumped up a level. Technically, Google has performed a meta maneuver; that is, Google has looked at the problems of data scale, data transformation (a function that can consume as much as 30 percent of an IT department’s budget), and the need to find a way to do input output and read write without slowing operations to a tortoise-like pace.
So, Microsoft is doing database; Google is doing data management of which database operations are a sub set and handled by MySQL and the odd Oracle installation.
What’s the difference?
In my experience, when you have to deal with large amounts of data, Dr. Codd’s invention is the wrong tool for the job. The idea of big static databases that have to be updated in real time is an expensive proposition, not to mention difficult. Sure, there are work arounds with exotic hardware and brittle engineering techniques. But when you are shoving petas, you don’t have the luxury of time. You certainly don’t have the money to buy cutting edge gizmos that require a permanent MIT engineer to baby sit the system. You want to rip through data as rapidly as possible yet have an “as needed” method to querying, slicing, dicing, and transforming.
That’s her concern, and I guess it is mine too, with regard to SQL Server 2008. The plow horse is going to be put in the Kentucky Derby, and it will probably finish the race, just too slow to win or keep the fans in their seats. The winners want to cash in their tickets and do other interesting things.
When it comes to next generation data manipulation systems, Googzilla may be the creature to set the pace for three reasons:
- Lower cost scaling
- Optimized for petabyte and larger data
- Distributed, massively parallel operation.
Agree? Disagree? Let me know. Just have some cost data so I can get back to my informant.
Stephen Arnold, July 10, 2008
Autonomy Discovers Virtualization (Not My Headline)
July 10, 2008
Internet News’s February 6, 2008, essay “Autonomy Discovers Virtualization” turned up in my news reader this morning. You can read the full but old story here.
The point of the article is that Autonomy acquired Zantaz. Zantaz has software called Intraspect. The Intraspect software is, according to Internet News, “the first to offer automated search or discovery in a wide range of virtual environments, including VMWare, a process that usually requires a time-consuming, manual set of steps, if it’s done at all.”
And who am I to doubt Internet News?
What caught my eye was the reference to VMWare. That company is in the news. ZDNews has a useful overview of the company’s problems here. My hunch is that filters are on the look out for VMWare as the company spirals into more rough winds. Autonomy may get some play, but in the context of VMWare, I am not sure the halo effect is working the way it should.
Oh, the Internet World reminded one of my engineers of former Vice President Al Gore’s statement about “inventing the Internet”. The word “discovers” in the Internet News story appears to have a similar effect on my technical team.
Stephen Arnold, July 10, 2008
A Bustling Brainware
July 10, 2008
A lousy economy is not affecting Brainware, judging from their recent actions. First, the company said that it was adding staff. Not one or two people, but a Deputy Vice President of Operations, two Project Managers, a Senior Account Executive, a Marketing Programs Director, a Customer Support Supervisor, a Recruiter, twelve senior development, product implementation and support resources, as well as various administrative and office staff. You can read the full scoop here. To put this in perspective, these additions number more employees than many search vendors have in their entire company. Most search and content processing companies are surprisingly small.
Another move is the hiring of Blaine Owens as a regional vice president for the company. You can read that item in full here. Mr Owens was a former EMC Capitva Software vice president. Brainware, in addition to its patented search method, has a document acquisition and work flow component.
For more information about the company, navigate to www.brainware.com or read this interview that appeared in ArnoldIT.com’s Search Wizards Speak series. One final point: if the news release links 404, you will be able to get most of the information from the Brainware Web site. PR stuff is tough to find after a day or two “in the wild”.
Stephen Arnold, July 10, 2008
WAND: New Business Taxonomy Available
July 10, 2008
Taxonomies are slightly less popular among the enterprise search crowd than Hanna Montana and petrol prices. WAND, a developer of controlled vocabulary tools and services, has rolled out what the company calls “a robust enterprise taxonomy.”
The idea is that most organizations remain clueless about taxonomies, controlled vocabularies, knowledge bases, and ontologies. The words are easy to say, but the ability to create a schema that a human being in an organization can use is a very different kettle of fish.
WAND’s taxonomy will allow a clueless or semi-clueless organization to get a taxonomy, edit it, and use the terms and hierarchies as a way to tag processed content. According to the company’s news release:
WAND’s new business vocabulary provides a four-level hierarchy of important business terminology covering human resources, accounting and finance, sales and marketing, legal, and information technology. The vocabulary includes all the core business concepts that any company has to deal with and can be extended and customized to include company specific terminology. WAND’s enterprise taxonomy can easily be paired with an existing enterprise search engine to improve the relevancy of search results returned.
You can learn more about the company and license fees here. I wrote about Arikus, another vendor offering off-the-shelf taxonomies here. I profile two other taxonomy players in my Beyond Search study for the Gilbane Group, Access Innovations and SchemaLogic. You can also tap MuseGlobal for this type of information as well. Some companies assert that you can learn how to “do” a taxonomy quickly by signing up for a one-day class. Okay, maybe that will work. It’s taken most of the professionals working on real-deal controlled vocabularies decades to hone their skills. I thought I knew words, but after working with Betty Eddison, founder of InMagic, and later with the Access Innovations’ team, I learned that I knew essentially zero. Fortunately, working with these folks helped me to be more informed about knowledge systems.
Take a peek at the WAND controlled term list and share what you learn with the two or three readers of this Web log.
Stephen Arnold, July 10, 2008
Copernic Desktop Search Updated
July 9, 2008
Copernic, the Canadian developer of search systems, has released a new version of Copernic Desktop Search. You can download a trial version here. Version 2.3 features speed improvements, a “did you mean” function to correct common misspellings, and a federation feature. A user can now search all index categories with an “All” feature. I particularly liked the “save search” feature. I often run the same query in the course of a project. For me, this is an important time saver. In my opinion, you will want to download the new version and drive it around your data race track.
Stephen Arnold, July 9, 2008
Autonomy Wins Spanish Health Service Deal
July 9, 2008
On July 8, 2008, my trusty news reader displayed “Spanish Health Service Provider Selects Autonomy to Deliver Innovation within Healthcare.” You can read more about the deal here. Hard on the heels of a slam dunk in Lyon, France, Autonomy appears to be mounting a sales charge in Western Europe.
The most interesting part of this deal is that it is in health care, one of the niches that search and content processing vendors are pursuing. Autonomy’s angle is health care modernization, which I think blends content processing and operational efficiencies. One part of the new system will “provide alerts on patient prescription conflicts as part of a national project.”
Some Spanish government agencies have been flirting with open source. Autonomy leveraged its IDOL (intelligent data operating layer) into a key position in this region, which is smaller than other Spanish governmental units.
Kudos to the Autonomy sales team. If I am able to determine if this was a clean win or an upsell from a previous Verity installation, I will include that detail in an update to this news item. I am seeing a flood of Autonomy related news and interviews with Autonomy professionals. Lots of activity from this giant in search, content processing, and related systems. In fact, from where I sit, Autonomy’s ramp up parallels Google’s “transparency” PR push. One oddity I noted is that Microsoft Fast Search has gone quiet in the search marketing visibility arena.
Stephen Arnold, July 9, 2008