Mysteries of Online 8: Duplicates
February 24, 2009
In print, duplicates are the province of scholars and obsessives. In the good old days, I would sit in a library with two books. I would then look at the data in one book and then hunt through the other book until I located the same or similar information. Then I would examine each entry to see if I could find differences. Once I located a major difference such as a number, a quotation, or an argument of some type, I would write down that information on a 5×8 note card. I had a forensics scholarship along with some other cash for guessing accurately on objective tests. To get the forensics grant, I had to participate in cross examination debate, extemporaneous speaking, and just about any other crazy Saturday time waster my “coaches” demanded.
Not surprisingly, mistakes or variances in books, journals, and scholarly publications were not of much concern to some of the students who attended the party school that accepted an addled goose with thick glasses. There were rewards for spending hours looking for information and then chasing down variances. I recall that our debate team, which was reasonably good if you liked goose arguments, were putting up with a team from Dartmouth College. I was listening when I heard a statement that did not match what I had located in a government reference document and in another source. The opponent from Dartmouth had erroneously presented the information. I gave a short rebuttal. I still remember the look of nausea that crossed our opponent’s face when she realized that I presented what I found in my hours of manual checking and reminded the judges that distorting information suggests an issue with the argument. We won.
For most people, the notion of having two individuals with the same source is an example of duplicate information. Upon closer inspection, duplication does not mean identical in gross features. Duplication drills down to the details of the information and to the need to determine which item of information is at variance and then figuring out why and what is the most likely version of the duplicate.
That’s when the fun begins in traditional research. An addled goose can do this type of analysis. Brains are less important than persistence and a toleration for some dull, tedious work. As a result, finding duplicative information and then figuring out variances was not something that the typical college sophomore spends much time doing.
Enter computer systems.
Deep Web, Surface Sparkles Occlude Deeper Look
February 23, 2009
You can read pundits, mavens, and wizards comment on the New York Times’s “Exploring a Deep Web that Google Can’t Grasp.” The original is here for a short time. Analysis of varying degrees of usefulness appear in Search Engine Land and the Marketing Pilgrim’s “Discovering the Rest of the Internet Iceberg” here.
There’s not much I can say to reverse the flow of misinformation about what Google is doing because Google doesn’t talk to me or to the pundits, mavens, and wizards who explain the company’s alleged weaknesses. In 2007, I wrote a monograph about Google’s programmable search engine disclosures. Published by BearStearns, this document is no longer available. I included the dataspace research in my Beyond Search study for The Gilbane Group in April 2008. In September, I then with Sue Feldman wrote about Google’s dataspace technology. You can get copy of the dataspace report directly from IDC here. Ask for document 213562. Both of these studies explicate Google’s activities in structured data and how those data mesh with Google’s unstructured information methods. I did a detailed explanation of the programmable search engine inventions in Google Version 2.0. That report is still available, but it costs money and I will be darned if I will restate information that is in a for fee study. There are some brief references to these technologies available at ArnoldIT.com without charge and in the archive to this Web log. You can search the ArnoldIT.com archive at www.arnoldit.com/sitemap.html and this Web log from the search box on any blog page.
This sure looks like “deep Web” information to me. But I am not a maven, wizard, or pundit. Nor do I understand search with the depth of the New York Times, search engine optimization experts, and trophy generation analysts. I read patent documents, an activity that clearly disqualifies me from asserting that Google can’t perform a certain action based on its disclosed in open source disclosures. Life is easier when such disclosures are ignored or excluded from the research process.
So what? Two points:
- Google can and does handled structured data. Examples exist in the wild at base.google.com and by entering the query “lga sfo” from Google.com’s search box.
- Yip yap about the “deep Web” has been popular for a while, and it is an issue that requires more analysis than assertions based on relatively modest research into the subject
In my opinion, before asserting that Google’s is baffled, off track, clueless, or slow on the trigger–look a bit deeper than the surface sheen on Googzilla’s scales. No wonder outfits are surprised with some of Google’s “effortless” initiatives. By dealing with superficiality, the substance is not seen for what resides under the surface.
Pundits, mavens, wizards, please, take moment to look into Guha, Halevy, and the other Googlers who have thought about and who are working on structured, semistructured, and unstructured data in the Google data environment. That background will provide some context for Google’s apparent sluggishness in this “space”.
Stephen Arnold, February 23, 2009
Microsoft: Job Search
February 23, 2009
You may be able to find a job if you are not an addled goose like me. To use the service, click here and search away. You will need to install Silverlight, Microsoft’s Adobe Flash killer, and you will have a number of opportunities to learn skills that make it easy for you to land a job as a Microsoft SharePoint developer or a Microsoft FAST ESP engineer. If you take a job at Microsoft and then lose it, Microsoft may want you to repay some of your severance buy out. You can read about this administrative dorkiness here. Somehow in my addled goose brain drifting in the mine run off slurry this afternoon, the “elevate” and “give us back your severance” fail to inspire confidence that the company has packed its Winnebago for a trip to Tomorrowland.
Stephen Arnold, February 23, 2009
Exclusive Interview, Martin Baumgartel, From Library Automation to Search
February 23, 2009
For many years, Martin Baumgartel worked for a unit of T-Mobile. His experience spans traditional information retrieval and next-generation search. Stephen Arnold and Harry Collier interviewed Mr. Baumgartel on February 20, 2009. As one of the featured speakers at the premier search conference this spring, you will be able to hear Mr. Baumgartel’s lecture and meet with him in the networking and post presentation breaks. The Boston Search Engine Meeting attracts the world’s brightest minds and most influential companies to an “all content” program. You can learn more about the conference, the tutorials, and the speakers at the Infonortics Ltd. Web site. Unlike other conferences, the Boston Search Engine Meeting limits attendance in order to facilitate conversations and networking. Register early for this year’s conference.
What’s your background in search?
When I entered the search arena in the 1990s, I originated from library automation. Back then, it was all about indexing algorithms and relevance ranking where I did research to develop a search engine. During eight years at T-Systems, we analyzed the situation in large enterprises in order to provide the right search solution. This included, increasingly, the integration of semantic technologies. Given the present hype about semantic technologies, it has been a focus in current projects to determine which approach or product can deliver in specific search scenarios. A related problem is to identify underlying principles of user-interface-innovations to know what’s going to work (and what’s not).
What are the three major challenges you see in search / content processing in 2009?
Let me come at this in a non technical way. There are plenty of challenges awaiting algorithmic solutions, I see more important challenges here:
- Identifying the real objectives, fighting myths For an organization to implement internal search today hasn’t become any easier. There are numerous internal stakeholders, paired with a very high user expectation (they want the same quality as with Internet search, only better, more tailored to their work situation and without advertising…). To keep a sharp analysis becomes difficult in an orchestra of opinions, in particular when familiar brand names get involved (“Let’s just take Google internally, that will do.” )
- Avoid simplicity. Although many CIOs claim they have “cleaned up” their intranets, enterprise search remains complex; both technological and in terms of successful management. Therefore, to tackle the problem with a self-proclaimed simple solution (plug in, ready, go) will provide Search. But perhaps not the search solution needed and with hidden costs, especially on the long run. In the other extreme, a design too complex – with the purchase of dozens of connectors – is likely to burst your budget.
- Attention. Recently, I heard a lot about how the financial crisis will affect search. In my view, the effects are only reinforcing the challenge “How to draw enough management attention to Search to make sure it’s treated like other core assets”. Some customers might slow down the purchase of some SAP add-on modules or postpone a migration to the next version of Backup Software. But the status of those solutions among CIOs will remain high and un questioned.
With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?
There’s no unique definition of the ‘Enterprise Search Problem” as if it would be a math theorem. Therefore, you find somehow amorphous definitions about what is to be solved. Let’s take the scope of content to be searched: everything internal? And nothing external? Another obstacle is the widespread believe in shortcuts. Popular example: Let’s just index the content present in our internal content management system, the other content sources are irrelevant. That way, the concept of completeness in search/result set is sacrificed. But search can be as gruesome as the Marathon: you need endurance and there are no shortcuts. If you take a shortcut, you’ve failed.
What is your approach to problem solving in search and content processing?
Smarter software definitely, because the challenges in search (and there are more than three) are attracting programmers and innovators to come up with new solutions. But, in general, my approach is “keep your cool”. Assess the situation, analyze tools and environment, design the solution and explain it clearly. In the process, interfaces have to be improved sometimes in order to trim them down to fit with the corporate intranet design.
With the rapid change in the business climate, how will the increasing financial pressure on information technology affect search / content processing?
We’ll see how far a consolidation process will go. Perhaps we’ll see discontinued search products where we initially didn’t expect it. Also, the relation asked in the following question might be affected: software companies are unlikely to cut back at core features of their product. But integrated search functions are perhaps identified for the scalpel.
Search / content processing systems have been integrated into such diverse functions as business intelligence and customer support. Do you see search / content processing becoming increasingly integrated into enterprise applications?
I’ve seen it the other way around: Customer Support Managers told me (the Search person) that the built-in search-tool is ok but that they would like to look up additional information from some other internal applications. I don’t believe that built-in search will replace stand-alone search. The term “built-in” tells you that the main purpose of the application is something else. No surprise that, for instance, the user interface was designed for this main purpose – and will, in conclusion, not address typical needs of search.
Google has disrupted certain enterprise search markets with its appliance solution. What can a vendor do to adapt to this Google effect?
A vendor should point out where he differs from Google and why to address this Google-effect.
But I see Google as a significant player in enterprise search, if only for the mindset of procurement teams you describe in your question.
As you look forward, what are some new features / issues that you think will become more important in 2009?
The issue of cloudsourcing will gain traction. As a consequence, not only small and medium sized enterprises will discover that they might not invest in in house Content Management and Collaboration applications, but use a hosted service instead. This is when you need more than a “behind the firewall” search, because content will be scattered across multiple clouds (CRM cloud, Office cloud). I’m not sure whether we see a breakthrough there in 36 month; but the sooner the better.
Where can I find more information about your services and research?
http://www.linkedin.com/in/mbaumgartel
Stephen E. Arnold, www.arnoldit.com/sitemap.html and Harry Collier, www.infonortics.com
Autonomy Encomium
February 23, 2009
If you love Autonomy, you will delight in “Autonomy Continues the Path to eDiscovery with Conceptual Search.” The story appeared in CMSWire here. The write up follows a familiar and entertaining path. The news was so good that Morningstar documented here that Autonomy’s CFO Sushovan Hussain snapped up 85,000 shares of Autonomy stock in February 2009. For this deal, Mr. Hussain sold some shares and turned around and bought more. Life seems to be good for Autonomy as its competitors paddle harder, Autonomy sails on the winds of success.
Stephen Arnold, February 23, 2009
eZ Find: SOLR and More for Search
February 23, 2009
There’s another updated open source search product on the market. eZ, http://ez.no/, has tuned up eZ Find, http://ez.no/ezfind, a search extension for eZ Publish, http://ez.no/ezpublish, its enterprise web content management system. The extension is free to download and install on eZ Publish sites at http://ez.no/ezfind/download. eZ Publish comes out of the box prepped to help you get your content up online ASAP: publishing through both browsers or word processors, translations, multiple file uploads, picture and video galleries and search. eZ Find is just one piece of the puzzle, and with it now you can fine-tune relevance ratings, use faceted searches, take advantage of boolean, fuzzy, and wildcard operators, etc. eZ Publish license information is at http://ez.no/software/proprietary_license_options, and it looks like it has lots of happy customers, list at http://ez.no/customers. Stay tuned for more updates.
Jessica W. Bratcher, February 23, 2009
NSA Oral Histories Available
February 23, 2009
If you are looking for a test corpus against which to benchmark a search system, take a look at the National Security Agency’s declassified oral interviews. A happy quack to the reader who alerted me to BeSpecific’s write up “Declassified Oral History Interviews Posted by National Security Agency” here. Grab ’em quick. The NSA, according to the write up, has reworked its Web site. I enjoyed the “Doing Business with the NSA.” Interesting if not exactly in line with how the world in Beltway Bandit land often works. For more NSA content, run this query on Uncle Sam.
Stephen Arnold, January 23, 2009
IBM OmniFind Made Simple
February 23, 2009
If you are enamored of IBM software, you will thrill to Kent Milligan’s lucid write up about IBM Omnifind, the free edition. Mr. Milligan touches on the new semantic features in OmnfFind, provides some set up tips, and points to links with more information for those with an Omnifind appetite. The write up “DB2 for i DBA: OmniFind Text Search Server” is here. In addition to the code snippets, the most interesting chunk of the write up for me was:
This new product provides the ability to quickly search text with advanced linguistic methods. Even more exciting is the fact that these text-search capabilities are not just limited to simple text strings stored in databases; they can also be applied to text stored in document formats such as Adobe PDF or Microsoft Word.
Hard charging OmniFinder will know most of the info in the article. For those wanting an overview, Mr. Milligan saves us some ramp up time.
Stephen Arnold, February 23, 2009
The Frustration Machine: Yahoo Shopping Search
February 22, 2009
Navigate to Yahoo. Click on Shopping. Enter this query, “discount carpet tiles”. Scan the results list. No carpet tiles. The list contains cleaner, door mats, and a rotary jet extractor. The phrase “carpet tiles” generates two pages of results. Zero relevance. The ads are a different story. On the first page of results, Yahoo displays nine ads with the phrase “carpet tiles” and one ad inviting me to take a survey and maybe win a $1,000. With news of a Yahoo reorganization flapping across my monitor, I sure hope the new set up tackles the fundamental ineffectiveness of eCommerce search. Here’s Bloomberg’s description of the reorganization. I heard Bainies are helping Yahoo. I wonder if the MBA trophy generation wizards have run the “discount carpet tiles” query on Google. Give it a whirl here. Nah, that’s too trivial an exercise for Yahoo poobahs, search wizards, and consultants.
Stephen Arnold, February 22, 2009
Number 13 in the Biggest Technology Goof List
February 22, 2009
ComputerWorld published on February 22, 2008, here a list of the “The 25 Greatest Blunders in Tech History.” I find these lists amusing. I paddled right by the first 12 and the last 12. I focused on blunder number 13:
Search portals. Where are they now? At the height of the dot-com boom, web surfers had a plethora of search engines to choose from: AltaVista, Excite, InfoSeek, Lycos, and many more. Today, the major players of the past are mostly dead. A few have soldiered on, such as Ask.com, but only after repeated redesigns. Chalk it up to old-fashioned hubris. Instead of concentrating on their search offerings, the first-generation search engines fell victim to the portal arms race. They built up dashboards full of sports scores, stock quotes, news headlines, horoscopes, the weather, email, instant messaging, games, and sponsored content – until finding what you wanted was like playing Where’s Waldo. Neither fish nor fowl, they became awkward combinations of search portals and general-interest portals. The world went to Yahoo for the latter. And when an upstart called Google appeared with a clean UI and high-quality search, users told the other engines to get lost.
The consequence of the portal mania. Our pal Googzilla. The failure of portals opened the door to my favorite example of received wisdom (portals are the future) creating the space for a hyperconstruct to reshape online, search, and a number of other businesses. I would have moved this goof to the top 10. But 13 remains an unlucky number for the companies who jumped on the portal bandwagon a decade ago.
Stephen Arnold, February 22, 2009.