Duplicates and Deduplication
December 29, 2008
In 1962, I was in Dr. Daphne Swartz’s Biology 103 class. I still don’t recall how I ended up amidst the future doctors and pharmacists, but there I was sitting next to my nemesis Camille Berg. She and I competed to get the top grades in every class we shared. I recall that Miss Berg knew that there five variations of twinning three dizygotic and two monozygotic. I had just turned 17 and knew about the Doublemint Twins. I had some catching up to do.
Duplicates continue to appear in data just as the five types of twins did in Bio 103. I find it amusing to hear and read about software that performs deduplication; that is, the machine process of determining which item is identical to another. The simplest type of deduplication is to take a list of numbers and eliminate any that are identical. You probably encountered this type of task in your first programming class. Life gets a bit more tricky when the values are expressed in different ways; for example, a mixed list with binary, hexadecimal, and real numbers plus a few more interesting versions tossed in for good measure. Deduplication becomes a bit more complicated.
At the other end of the scale, consider the challenge of examining two collections of electronic mail seized from a person of interest’s computers. There is the email from her laptop. And there is the email that resides on her desktop computer. Your job is to determine which emails are identical, prepare a single deduplicated list of those emails, generate a file of emails and attachments, and place the merged and deduplicated list on a system that will be used for eDiscovery.
Here are some of the challenges that you will face once you answer this question, “What’s a duplicate?” You have two allegedly identical emails and their attachments. One email is dated January 2, 2008; the other is dated January 3, 2008. You examine each email and find that difference between the two emails is in the inclusion of a single slide in the two PowerPoint decks. You conclude what:
- The two emails are not identical and include both and the two attachments
- The earlier email is the accurate one and exclude the later email
- The later email is accurate and exclude the earlier email.
Now consider that you have 10 million emails to process. We have to go back to our definition of a duplicate and apply the rules for that duplicate to the collection of emails. If we get this wrong, there could be legal consequences. A system develop who generates a file of emails where a mathematical process has determined that a record is different may be too crude to deal with the problem in the context of eDiscovery. Math helps but it is not likely to be able to handle the onerous task of determining near matches and the reasoning required to determine which email is “the” email.
Which is Jill? Which is Jane? Parents keep both. Does data work like this? Source: http://celebritybabies.typepad.com/photos/uncategorized/2008/04/02/natalie_grant_twins.jpg
Here’s another situation. You are merging two files of credit card transactions. You have data from an IBM DB2 system and you have data from an Oracle system. The company wants to transform these data, deduplicate them, normalize them, and merge them to produce on master “clean” data table. No, you can’t Google for an offshore service bureau, you have to perform this task yourself. In my experience, the job is going to be tricky. Let me give you one example. You identify two records which agree in field name and data for a single row in Table A and Table B. But you notice that the telephone number varies by a single digit. Which is the correct telephone number? You do a quick spot check and find that half of the entries from Table B have this variant, or you can flip the analysis around and say that half of the entries in Table A vary from Table B. How do you determine which records are duplicates.
Microsoft SharePoint and the Law Firm
December 22, 2008
Lawyers are, in general, similar to Scrooge McDuck. If you are too young to remember, the Donald Duck funny papers, Scrooge McDuck was tight with a penny. Lawyers eschew capital expenditures if possible. When a client foots the bill, the legal eagles will become slightly less abstemious, but in my experience, not too profligate with money.
Microsoft SharePoint offers an unbeatable combination for some law firms. Because the operating system is Microsoft’s, lawyers know that programmers, technical assistance, and even the junior college introductory computer class can be a source of expertise. And, Microsoft includes with SharePoint a search system. Go with Microsoft and visions of lower initial costs, bundles, and a competitive market from which to select the technical expertise you need. What could be better? Well, maybe a big pharma outfit struggling with a government agency? Most attorneys would drool with anticipation to work for either the company or the US government. A new client is more exciting than software.
Several people sent me links to Mark Gerow’s article “Elements of a Successful SharePoint Search.” You can read the full text of his article at Law.com here. The article does a good job of walking through a SharePoint installation for a law firm. You will also find passing references to other vendors’ systems. The focus is SharePoint.
Could this be a metaphor for a SharePoint installation?
I found several points interesting. First, Mr. Gerow explains why search in a law firm is not like running a query on Microsoft’s Web search or any other Web indexing system. There is a reference to Google’s assertion that it has indexed one trillion Web pages and an accurate comment about the inadequacy of Federal government information in public search systems. I am not certain that attorneys will understand why Google has been able to land some law firms and a number of Federal agencies as customers with its search appliance. I know from experience that many professionals have a difficult time differentiating the content that’s available via the Web, content on the organization’s Web site, content on an Intranet, and content that may be available behind a firewall yet pulled from various sources. Also, I don’t think one can ignore the need for specialized systems to handle information obtained during the discovery process. Those systems do search, but law firms often pay hundreds of thousands of dollars because “traditional” search systems don’t do what attorneys need to do when preparing their documentation for litigation. These topics are referenced but not in a way that makes much sense for SharePoint, a singularly tricky collaborative, content management, search, and Swiss Army Knife collection of software packages as “one big thing”.
ChunkIt’s Evolution of Search
December 21, 2008
Happy quacks to the readers of this Web log for sending me links and snippets from “The Evolution of Search” by Admin here. I tried to answer the questions two people sent me about statements in this article. I wanted to offer some broad comments before these ideas get lost in the lumber room of my small goose brain. Keep in mind that this little goose brain of mine has concluded that Google has “won” the search wars. With an expanding market share in a down economy, its competitors have yet to demonstrate that they can kill the GOOG or leapfrog the beast. This, I hope, will be a controversial assertion and whip some of my readers into a crazed frenzy. I have learned that this type of open discussion does wonders for Web log traffic. Honk, honk.
ChunkIt is savvy enough to play the same game. So, the first point about “The Evolution of Search” is that the information is designed to promote ChunkIt, and there is nothing wrong with that. Second, the notion of evolution allows the author to create a narrative. In some places, the story line is a bit stretched, but in its broad outlines, a reader comes away from the article with an understanding of the unpredictability of search. Finally, the conclusion is also okay with me because the write up is clearly labeled as a ChunkIt effort. I don’t find anything wrong with tooting one’s own horn. Click here to buy a copy of Martin White’s and my new study Successful Enterprise Search Management. Monkey see, monkey do. That’s the story of the webby world, both digital and goose varieties I must say.
I want to comment on three points in the ChunkIt evolution article. I am not out to win friends and influence people, so stop reading if my penchant for looking at issues from a different perspective gives you a migraine. Cyrus, Martin, Bye Barry. Yes. I mean you.
First, the whole search revolution was a fluke. Search is actually old, much older than today’s 20-somethings think. There were corollaries for today’s neatest systems in the 1960s and 1970s. The systems sucked because of the limitations of hardware and programming tools. As the hardware became cheaper and more robust, programming tools hip hopped right along. Now, three decades later, whiz kids are dipping into their copy of Numerical Recipes and reinventing the past. So, the explosion of information, the shift to more users and a broader market, and the emergence of more capable, smarter software blundered forward in a two steps forward and one step back mode. By the mid 1990s, the avalanche had shifted from potential to real energy. I don’t do much history in my analyses of search because the same old stuff keeps getting recycled. History is a mass of cheap spaghetti, not a tidy box of pasta. ChunkIt falls into the trap of making a mess fit into a box. That does not work for me. You may find the approach useful. I don’t.
My view of the evolution of search. This image is from the Joe-KS Web site. What a wonderful illustration of the evolution of search technology. Source: http://www.joe-ks.com/archives_apr2006/EvolutionOfMan.jpg
SharePoint: ChooseChicago
December 18, 2008
I scanned the MSDN Web log postings and saw this headline: “SharePoint Web Sites in Government.” My first reaction was that the author Jamesbr had compiled a list of public facing Web sites running on Microsoft’s fascinating SharePoint content management, collaboration, search, and Swiss Army Knife software. No joy. Mr. Jamesbr pointed to another person’s list which was a trifle thin. You can check out this official WSS tally here. Don’t let the WSS fool you. The sites are SharePoint, and there are 432 of them as of December 16, 2008. I navigated to the featured site, ChooseChicago.com. My broadband connection was having a bad hair day. It took 10 seconds for the base page to render and I had to hit the escape key after 30 seconds to stop the page from trying to locate a missing resource. Sigh. Because this was a featured site that impressed Jamesbr, I did some exploring. First, I navigated to the ChooseChicago.com site and saw this on December 16, 2008:
The search box is located at the top right hand corner of the page and also at the bottom right hand corner. But the search system was a tad sluggish. After entering my query “Chinese”, the system cranked for 20 seconds before returning the results list:
K-Now: Here and Now
December 17, 2008
Guest Feature by Dawn Marie Yankeelov, AspectX.com
I have been discussing progress in semantic knowledge structures with Entrepreneur and Researcher Sam Chapman of K-Now who has recently left the University of Sheffield, Department of Computer Science, in the United Kingdom to go full-time into the delivery of semantic technologies in the enterprise. His attendance at the ISWC 2008 has created some momentum to engage new corporations in a discussion on a recently presented paper on “Creating and Using Organisational Semantic Webs in Large Networked Organisations” by Ravish Bhagdev, Ajay Chakravarthy, Sam Chapman, Fabio Ciravegna and Vita Lanfranchi. Knowledge management has shifted as evidenced in his paper. He contends with others that a more localized approach based on a particular perspective of the world in which one operates is far more useful than a centralized company view. All-encompassing ontologies are not the answer, according to Chapman. In the paper, his team indicates:
A challenge for the Semantic Web is to support the change in knowledge management mentioned above, by defining tools and techniques supporting: 1) definition of community-specific views of the world; 2) capture and acquisition of knowledge according to them; 3) integration of captured knowledge with the rest of the organisation’s knowledge; 4) sharing of knowledge across communities.
At K-Now, his team is focused upon supporting large scale organizations to do just this:capturing, managing and storing knowledge and its structures, as well as focusing upon how to reuse and query flexible dynamic knowledge. Repurposing trusted knowledge in K-Now is not based on fixed corporate structures and portal forms, but rather from capturing knowledge in user alterable forms at the point of its generation. Engineering forms, for example, that assist in monitoring aerospace engines during operations worldwide can be easily modified to suit differing local needs. Despite such modifications being enabled this still captures integrated structured knowledge suitable for spotting trends. Making quantitative queries without any pre-agreed central schemas is the objective. This is possible, under K-Now’s approach, due to the use of agreed semantic technology and standards.
Cloud Computing Challengers: Pundits Cheer for Their Clients
December 15, 2008
A happy quack to the reader who alerted me to Briefings Direct. The link pointed me to a transcript of a discussion among a group of analysts. I am fascinated by the prognostications of pundits. The deepening economic crisis and miserable track record of large information technology projects make me hungry for information about the future. These pundits are responsible for approaches to systems that some companies embrace. I could draw a connection between pundits and the present crisis in information technology, but I will not. Instead, let me capture the points that I noted as I worked my way through “BriefingsDirect Analysts Handicap Large IT Vendors on How Cloud Trend Impacts Them.” The set up is that cloud computing is a trend and the large information technology vendors are horses in a race. The participants in this discussion will, in theory, give the odds for selected vendors’ in the cloud computing horse race.
First, one of the pundits asserts that Microsoft “has the most to lose.” I believe that “lose” means revenues from on premises licensing of Office and products such as SQL Server. Okay, I understand that point and I see a grain of truth in the “most to lose” remark. One thought I have is that Microsoft is spending to make its software and services’ strategy viable. The spending coupled with erosion of on premises revenue ups the ante. Maybe the Briefings Direct session will focus on the economics of the Microsoft “to be” architecture.
Second, one of the panelists points to big companies like IBM and SAP who sell direct and who have established reseller channels. The idea is that these vendors are trying to maximize their sales impact. I am uncomfortable with the implication that big vendors have their act together. SAP, one of the companies with this sell direct and rely on an ecosystem approach, is in a tar pit. SAP’s missteps may be a glimpse of what will happen when more big companies with zero track record in offering software from data centers on a subscription basis try to make money. Again, the combination of capital investment and lack of experience may make these initiatives bleed red ink. IBM, for instance, tried Internet services and dumped the business to AT&T. IBM is a consulting company that also pushes software and hardware. There are too many assumptions about big companies succeeding in the cloud for me to be optimistic.
Third, the notion that cloud computing is a wide open race is intriguing. I think cloud computing, like telephony in the early 20th century, is one of the those utility services. If a single company has an advantage, that company could make it difficult for competitors to get sufficient market traction to survive. Customers unwittingly act to make a monopoly come into existence. I see a hint of this in Google’s dominance of Web search and advertising. A company like Google or maybe Google itself could capture a dominant position. The Google Effect, then, is companies without Google’s technology advantage and customer base spending to catch up. Google keeps moving forward incrementally and retains its dominant position. I think that Google’s 70 share of Web search could be replicated in other cloud markets as well. The diversity dies out even some vendors offer better solutions.
Who will win? IBM, Microsoft, Oracle, or SAP?
Fourth, the cloud approach means “flex sourcing”; that is, (I think) using what you need in the cloud and having some software on premises. My thought is that this is a commonsense approach. I don’t think “commonsense” applies to cloud services. The cost and complexity of on premises systems is probably a deal breaker for many organizations in today’s economic climate. The shift to cloud services may be forced upon companies. If a flash point exists, the cloud shift could be somewhat sudden, maybe like the emergence of the Internet as a utility. In that type of situation, the rules go out the window. Commonsense says, “That’s not likely to happen” as the shift occurs. These pundits are supporting the status quo and the status quo is crumbling as they opine.
Finally, one of the pundits suggests that a Microsoft – Yahoo tie up is good for Microsoft. This point caps a discussion of tie ups for cloud services; for example, Amazon and Oracle. Wow. I am not sure if Microsoft has the technical savvy to fix up the nicks and scratches in the Yahoo infrastructure at the same time it is struggling to build out and optimize its own data centers. I think the cheerleading for the big companies ignores the fundamental problem of getting from “to be” to “as is.” I am no Google fan goose, but I think the GOOG is going to stomp on some of these assertions relative to Microsoft and the other big companies mentioned in this discussion.
HP: Hardware, Software, or Ink
December 12, 2008
Hewlett Packard is an ink company to me and forever will be. HP ink works out to several thousand dollars a gallon. I calculated the amount once, but then I learned that the cartridges are often not filled to the brim. Without reliable cubic info, I stopped fiddling with numbers and adopted the “HP as an ink company” approach.
I recall getting a bulk mailing urging me to buy HP servers. I recall that HP asserted that it is one of the top two or three sellers of high end servers. I heard Google offer the same generalization. Then, in an IBM briefing I heard the same argument. I wondered, “What’s with Dell? Why is that company asserting it is one of the top three server vendors?”
The article “HP Strives for Recognition as Major Software Player” on the ZDNet.co.uk site here puzzled me. HP ships software with its printers that I find useful as examples of bloatware. I enjoy the messages to replace my ink cartridges, which reinforces the ink company notion. I recall struggling with various HP printer drivers, including a most remarkable PostScript driver that produced pages the size of stamps. We never figured out a work around. I just bought a Ricoh with PostScript that worked. I donated the HP printer to a charity and got a $10 tax deduction.
The point of the ZDNet.co.uk write up is this quote attributed to an HP executive:
In general, from an industry perspective, we think we’ve made huge progress but there’s still further room for growth in terms of brand and awareness,
I don’t agree. “In general” means that the assertion is a glittering generality. “Huge progress”, I don’t think so. For example, I have an HP laptop and it has quite a few weird software programs installed by default. I don’t know what they were, so I nuked them with Revo Uninstaller. Laptop seems okay but when I visited the HP Web site to look for updated drivers, I had absolutely no clue how to determine which software went with which notebook variant. I loaded one driver and everything worked until the reboot. Then the new driver blew away the USB and SD card functions. I rolled back the driver and forgot about the upgrade.
Yep, “huge progress.”
Information 2009: Challenges and Trends
December 4, 2008
Before I was once again sent back to Kentucky by President Bush’s appointees, I recall sitting in a meeting when an administration official said, “We don’t know what we don’t know.” When we think about search, content processing, assisted navigation, and text mining, that catchphrase rings true.
Successes
But we are learning how to deliver some notable successes. Let me begin by highlighting several.
Paginas Amarillas is the leading online business directory in Columbia. The company has built a new systems using technology from a search and content processing company called Intelligenx. Similar success stories and be identified for Autonomy, Coveo, Exalead, and ISYS Search Software. Exalead has deployed a successful logistics information system which has made customers’ and employees’ information lives easier. According to my sources, the company’s chief financial officer is pleased as well because certain time consuming tasks have been accelerated which reduces operating costs. Autonomy has enjoyed similar success at the US Department of Energy.
Newcomers such as Attivio and Perfect Search also have satisfied customers. Open source companies can also point to notable successes; for example, Lemur Consulting’s use of Flax for a popular UK home furnishing Web site. In Web search, how many of you use Google? I can conclude that most of you are reasonably satisfied with ad-supported Web search.
Progress Evident
These companies underscore the progress that has been made in search and content processing. But there are some significant challenges. Let me mention several which trouble me.
These range from legal inquiries into financial improprieties at Fast Search & Transfer, now part of Microsoft to open Web squabbles about the financial stability of a Danish company which owns Mondosoft, Ontolica, and Speed of Mind. Other companies have shut their doors; for example, Alexa Web search, Delphes, and Lycos Europe. Some firms such as one vendor in Los Angeles has had to slash its staff to three employees and take steps to sell the firm’s intellectual property which rightly concerns some of the company’s clients.
User Concerns
Another warning may be found in the results from surveys such as the one I conducted for a US government agency in 2007 that found dissatisfaction with existing search systems in the 65 percent range. AIIM, a US trade group, reported slightly lower levels of dissatisfaction. Jane McConnell’s recently released study in Paris reports data in line with my findings. We need to be mindful that user expectations are changing in two different ways.
First, most people today know how to search with Google and get useful information most of the time. The fact that Google is search for upwards of 65 percent of North American users and almost 75 percent of European Union users means that Google is the search system by which users measure other types of information access. Google’s influence has been essentially unchecked by meaningful competition for 10 years. In my Web log, I have invested some time in describing Microsoft’s cloud computing initiatives from 1999 to the present day.
For me and maybe many of you, Google has become an environmental factor, and it is disrupting, possibly warping, many information spaces, including search, content processing, data management, applications like word processing, mapping, and others.
Microsoft is working to counter Google, and its strategy is a combination of software and low adoption costs. I believe that Microsoft’s SharePoint has become the dominant content management, collaboration, and search platform with 100 million licenses in organizations. SharePoint, however, is not well understood as technically complex and a work in progress. Anyone who asserts that SharePoint is simple or easy is misrepresenting the system. Here’s a diagram from a Microsoft Certified Gold vendor in New Zealand. Simple this is not.
Search: Simplicity and Information Don’t Mix
December 1, 2008
In a conversation with a bright 30 something, I learned that a person insisted that the Google Search Appliance was “simple and easy”. I asked the person, “Did the speaker understand that information is inherently difficult so search is not usually simple?”
The 30 something did not hesitate. “Google makes the difficult look easy.”
The potential search system customer might hear the word “simple” and interpret the word and its intent based on the listener’s experience, knowledge, and context. “Simple”, like transparency, is a word that covers a multitude of meanings.
My concern is that search has to deliver information to a user with a need for fact, opinion, example, or data. None of these notions is known to the software, electrical devices, and network systems without considerable technical work. Computers are generally pretty predictable. Smart software improves the gizmo, but the smarter software becomes the less simple it is.
So, when a system like the Google Search Appliance or any search system for that matter is described as simple, I have questions. I don’t think the GSA is simple. The surface interface is simplified. The basic indexing engine is locked up and accessible via point and click interfaces or scripts that conform to the OneBox API. But anyone who has tried to cluster GSAs and integrate the system into proprietary file types knows that the word “simple” is pretty much wrong.
Now what about search becoming “simple and easy”?
Search is simple because of the browser and the need to type some words in a search box or look at a list of links and click one. Search is not simple. I would go so far as to say that any system that purports to allow a user to access digital information is one of the most complex technical undertakings engineering, programmers, and other specialists have undertaken.
That’s why search is generally annoying to most of the people who have to use the systems.
Now let’s consider the notion of a “transparent search system.” I have to tell you that I don’t know why the word “transparency” has become a code word for “not secret”. When someone tells me that a company is transparent, I don’t believe them. A company cannot be transparent. Most outfits have secrets, market with ferocity first and facts second, and wheel and deal to the best of their ability. None of this “information” becomes available unless there’s a legal matter, a security breach, or a very careless executive.
Are search systems transparent? Nope. Consider Autonomy, Google, or any of the high profile vendors of information access systems. Google does not allow licensees to poke around the guts of the GSA. Autonomy keeps the inner workings of IDOL under wraps. I have heard one Autonomy wizard say,”Sometimes we need to get Mike Lynch to work some of his famous magic to resolve an issue.” I track about 350 companies in the search and content processing space. I make my living trying to figure out how these sytems work. Sue Feldman and I wrote a 10-page paper about one small innovation that interests Google. Nothing about that innovation was transparent, nor was it “simple” I might add.
What’s Up?
I think that consultants and parvenues need an angle on search, content processing, text mining, and information access. Since search is pretty complicated, who can blame a young person with zero expertise for looking at the shopping list of issues that are addressed in Successful Enterprise Search Management, and deciding to go the “simple” route.
I understand this. I worked at a nuclear consulting firm for a number of years. I always thought I was pretty good in math, physics, and programming (if the type of programming done in 1971 could be considered sophisticated). Was I wrong? I was so wrong it took me one year to understand that I knew zero about the recent work in nuclear physics. By the end of the second year, I had a new appreciation for the role of Monte Carlo calculations in nuclear fuel rod placement. For example, you don’t inspect nuclear rods in an online reactor. You would have some helath problems. So, you used math, and you needed to be confident that when you moved those bundles of nuclear fuel around, you got the used up ones where they were supposed to go. Forget the modest health probem. The issue would be a tad more severe.
Search shares some complexity with nuclear physics. The essence of search today is hugely complex subsystgems that must perform so the overall system works. Okay, that applies to a nuclear reactor. You can’t really inspect what’s going on because there are too many data points. Yep, that’s similar to the need to know what’s happening in a reactor using math and models. A search system can exhibit issues that are tough to track down because no one human knows where a particular glitch may touch another function and cause a crash. Again, just like a nuclear reactor. Those control rooms you see in the films are complicated beasties for a reason. No one really knows what exactly is happening to cause an issue locally or remotely in the system.
Now who wants to say, “Nuclear engineereing is simple?” I don’t see too many people stepping forward. In fact, I think that most people know enough to not offer an opinion when it comes to nuclear engineering and the other disciplines required to keep the local power generation plant benign.
I can’t say the same for search. Serach is popular and it has attracted a lot of people who want to make money, be famous like a rock star, or who know one way to beat the financial down turn is to cook up an interesting spin on a hot topic. I congratulate these people, but I think the likelihood of creating trouble is going to be quite high.
I have learned in my 65 years one thing:
What looks simple isn’t.
Try and do what a professional does. You probably won’t be able to do it. Whether physical or intellectual, if you haven’t done the time, you can’t equal the professionals’. Period.
At a conference, a speaker mentioned that for a person to become accomplished, the individual has to work at a particular skill or task for 10,000 hours. I know quite a few people who have spent 10,000 or more hours working on search. I wrote a book with one of these people, Martin White. I am a partner with another, Miles Kehoe. I know maybe 50 other people in the same select group. Most of the consultants and experts I meet are not experts in search. These people are expert at being friendly or selling. Those are great compentencies, but they are not search related.
If you have read a few of my previous posts in this Web log, you know that any search or content processing system described as “simple” or “easy” is most definitely not either. Search is complicated. Marketing and sales “professionals” routinely go to meetings and say, “Search is simple. Our system is completely open. Your own technical team can maintain the system.” In most cases, I don’t believe the pitch.
That’s why the majority of users are annoyed with search in an organization. And why most of the search systems end up in quite a pickle. See the upside down and back wards engine in the picture below. How did this happen? I haven’t a clue, and that is how I react when I see a crazy search and information access system at an organization.
Let me give you an example. A large not for profit and government subsidized think tank had the following search systems: Microsoft SharePoint, Open Text, multiple Google Search Appliances, and a couple of legacy systems I had not encountered for a decade. Now the outfit wants to provide a single interface to the content processed by this grab bag of systems. What makes this tough is that one can use any of the systems to provide this access. The organization did not know how to do this and wanted to buy a new system to deliver the functionality. Crazy. What the outfit now has is another search system and the problem is just more complicated. The “real fix” required thinking about the needs of the users and performing the intensive informatoin audit needed to determine the scale of the project. This type of “grunt work” was not desirable. The person describing this situation to me said, “We want a simple solution.”
I am sure they do. I want to be 18 again and this time I want to look like Brad Pitt, not some troll from the catacombs in Paris. Won’t happen.
How did we get our search system in this predicament?
Three Types of Simple Search
Let me give you three examples:
- Boil the ocean easy. Some vendors pitch a platform. The idea is that a licensee plugs in information connectors, the system processes the content, and the user gets answers. Guano. In fact, double guano. This approach is managerially, technically, and financially complex. Boiling the ocean solutions are the core reason why such outfits as IBM, Microsoft, Oracle, and SAP give away search. By wrapping complexity inside of complexity, the fees just keep rolling in. The multi month or multi year deployment cycles guarantee that the staff responsible for this solution will have moved on. Search in most boil the ocean solutions only works for some of the users.
- Buy ’em all. Use Web services to hook ’em up easy. Quite a few vendors take this approach. The verbal acrobatics of “federated search” or “metasearch” gloss over the very real problems of acquiring disparate content without choking the network, building a fortune on a repository infrastructure, and transforming the content to a representation are happily ignored or marginalized. Unfortunately these federated solutions require investment, planning, and building. I wish I had a dollar every time I have heard one vendor struggling to make significant sales say the words “federated” and “easy” in the same sentence.
- Unpack it, plug it in, and just search easy. This argument is now coming from vendors who ship search appliances and from vendors who ship software described as appliances. Hello, earth. Are you sentient? Plugging in an appliance delivers one thing: toast. These gizmos have to be set up. You have to plan for failure which means two gizmos and maybe clusters of gizmos. In case you haven’t tried to create hot spares and fail over search systems, the work is not easy. And you haven’t tackled the problem of acquiring, transforming, and processing the content. You haven’t fiddled with the interface that marketing absolutely has to have or the MBAs throw a hissy fit. Get real. When a modern appliance breaks, you don’t fix it. You buy another one. You don’t open a black box iPod or BlackBerry and repair it. You get a new one. The same applies to search. What’s “easy” is the action you take when the system doesn’t work.
Ad Networks
November 30, 2008
The Overflight technology here sparked some inquiries from companies in the ad network business. I never pay much attention to online advertising. My view is that if a Web site offers content, the Web indexing systems will find you. A good example is the Overflight service announced on November 17, 2008. On November 18, the site was not in the Google index. By November 20, 2008, the Overflight site ranked sixth in the results list for the word Overflight. As I write this before getting on a flight to Europe, the Overflight sight ranking in the results list for the query “overflight” is number two. No metatag spamming, no SEO baloney, no nothing. We index content and provide what seems to me to be a useful service. We are now adding some other features to the public facing Web site. The most interesting will be the use of the Exalead CloudView technology. This is a joint effort between my technically challenged goslings and the French wizards at Exalead. Watch for the announcement shortly. The service is in final testing and looks quite good so far.
But the ad network calls to me put me in unfamiliar territory. I have researched Google’s AdSense, which makes use of the Oingo (Applied Semantics) technology plus many Google inventions, enhancements, and tweaks. My focus on AdSense and its sister AdWords created for me a volcanic island of information. I thought that Google * was * online advertising.
The yellow box marks the Overflight result on November 29, 2008.
After a bit of research Google is not the only game in town. Sure, there are the Microsoft and Yahoo services that I know by name. A bit of sleuthing turned up a large number of outfits who are in the business of selling ads to companies wanting to reach online users. One of them is the AutoChannel.com, a company with which I have been associated for years. Because of the volume of traffic, the Auto Channel gets, I saw its name as a place where companies wanting to reach auto enthusiasts could advertise. You can learn more about this directly from the company. Just navigate here for the media kit.
I located on the Web logs at ZDNet here a useful list here of what the company calls “Top 50 US Ad Carriers in October 2008.” The usual suspects appear on this list, but there were many firms whose names I did not recognize. I clicked on about a dozen of the top ranked firms and learned that each provides a wide range of services both the high traffic Web sites looking to generate revenue and to advertisers who want to place messages on sites germane to their core markets. I can’t reproduce this list, but I think I can give you a flavor of the diversity of firms in this sector. Here are three companies I found interesting, but your taste is likely to be different from this goose’s: