QuickAnswers Currently Limited but Possibly Promising
August 7, 2014
Sphere Engineering is looking to reinvent the way Web information is organized with QuickAnswers.io. This search engine returns succinct answers to questions instead of results lists. More a narrowed Wolfram|Alpha than a Google. At least that’s the idea. So far, though, it’s a great place to ask a question—as long as it’s a question to which the system knows the answer. I tried a few queries and got back almost as many “sorry, I don’t know”s or nonsense responses. For now, at least, the page admits that “the current state of this project only reflects a tiny fraction of what is possible.” Still, it may be worth checking back in as the system progresses.
The company’s blog post about the project lets us in on the vision of what QuickAnswers could become. Software engineer François Chollet writes:
“I recently completed a total rewrite of QuickAnswers.io, based on a new algorithm. I call it ‘shallow QA’, as opposed to IBM Waston’s ‘deep QA’. IBM Watson keeps a large knowledge model available for queries and thus requires a supercomputer to run. At the other end of the spectrum, QuickAnswers.io generates partial knowledge models on the fly and can run on a micro-instance.
“QuickAnswers.io is a semantic question answering engine, capable of providing quick answer snippets to any question that can be answered with knowledge found on the web. It’s like a specialized, quicker version of a search engine. You can see a quick overview of the previous version here.”
The description then gets technical. Chollet uses several examples to illustrate the algorithm’s approach, the results, and some of the challenges he’s faced. He also explains his ambitious long-range vision:
“In the longer term, I’d like to read the entirety of the web and build a complete semantic Bayesian map matching a maximum of knowledge items. Also, it would be nice to have access to a visualization tool for the different answers available and their frequency across sectors of opinion, thus solving the problem of subjectivity.”
These are some good ideas, but of course implementation is the tough part. We should keep an eye on these folks to see whether those ideas make it to fruition. While pursuing such visionary projects, Sphere Engineering earns its dough by building custom machine-learning and data-mining solutions.
Cynthia Murrell, August 07, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Free Intranet Search System
August 7, 2014
Anyone on the lookout for a free intranet search system? FreewareFiles offers Arch Search Engine 1.7, also known as CSIRO Arch. The software will eat up 22.28MB, and works on both 32-bit and 64-bit systems running Windows 2000 through Windows 7 or MacOS or MacOS X. Here’s part of the product description:
Arch is an open source extension of Apache Nutch (a popular, highly scalable general purpose search engine) for intranet search. Not happy with your corporate search engine? No surprise, very few people are. Arch (finally!) solves this problem. Don’t believe it? Try Arch, blind test evaluation tools are included.
In addition to excellent search quality, Arch has many features critical for corporate environments, such as document level security.
Features:
*Excellent search quality: Arch has solved the problem of providing good search results for corporate web sites and intranets!
*Up to date information: Arch is very efficient at updating indexes and this ensures that the search results are up to date and relevant. Unlike most search engines, no complete ‘recrawls’ are done. The indexes can be updated daily, with new pages discovered automatically.
*Multiple web sites: Arch supports easy dynamic inclusion or removal of websites.
They also say the system is easy to install and maintain; uses two indexes so there’s always a working one; and is customizable with either Java or PHP.
Cynthia Murrell, August 07, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Tackling a Small SharePoint Cleanup
August 7, 2014
SharePoint cleanup is never fun, regardless of the size of the organization. However, there are ways to make it smoother than expected, and to take a bit of the pain out of the process. CMS Wire gives some advice on this process for small organizations in their article, “One Consultant’s Approach to a Small SharePoint Cleanup.”
The article begins:
“A pilot SharePoint cleanup project is straightforward. A consultant facilitating a small project to clean up a company’s SharePoint intranet can reach the lessons learned phase with a few basic tools. Recall the mantra: simple is elegant. You will require these basic tools: a project proposal, a workbook, a decision tree, a summary report.”
Stephen E. Arnold is also a helpful resource when it comes to SharePoint advice, tips, and tricks. He has made a career out of covering all things search, including SharePoint, and reporting on them via his Web service, ArnoldIT.com. His SharePoint feed is particularly helpful, and SharePoint users and managers for organizations of any size will find it useful. Keep an eye out for any tricks that might be helpful for your organization the next time you are called upon to update or cleanup your SharePoint implementation.
Emily Rae Aldridge, August 07, 2014
Connotate: Marketing by Listing Features
August 6, 2014
Connotate posted a page that lists 51 features. The title of the Web page is “What Connotate Does Better than Scripts, Scrapers, and Toolkits.” The 51 features are grouped into 10 categories. Several are standard content processing operations; for example, scaling, ease of use, and rapid deployment.
Several are somewhat fuzzy. A good example is the category “Efficiency”. Connotate explains this concept with these features:
- Highly efficient code is automatically generated during Agent training
- Agents bookmark the final destination and identify links that aren’t necessary, bypassing useless links and arriving at the desired data much faster
- Optimized navigation also generates less traffic on target websites
- Supports load balancing
- Multi-threaded – supports simultaneous execution of multiple Agents on a single system
- Optimizes resource usage by analyzing clues during runtime about the various intended uses of the extracted data
From my experience with training systems, I know that the process can be quite a job, particularly when the source content is not scientific, technical, and medical information. STM is somewhat easier because the terminology is less colorful than social media content, for example. The deployment of agents that do not trigger a block by a target is a good idea. But load balancing is a different type of efficiency and one that is becoming part of some vendor’s punch list.
I found the 51 items useful as a thought starter for crafting a listicle.
Stephen E Arnold, August 6, 2014
Data Augmentation: Is a Step Missing or Mislocated?
August 6, 2014
I read “Data Warehouse Augmentation, Part 4.” You can find the write up a http://ibm.co/1obWXDh. There are other sections of the write, but I want to focus on the diagrams in this fourth chapter/section.
IBM is working overtime to generate additional revenues. Some of the ideas are surprising; for example, positioning Vivisimo’s metasearch function as a Big Data solution or buying Cybertap and then making the quite valuable technology impossible to find unless one is an intelligence procurement official. Then there is Watson, and I am just not up to commenting on this natural language processing system.
To the matter at hand. There is basic information about in this write up about specific technical components of a Big Data solution. The words, for the most part, will not surprise anyone who has looked at marketing collateral from any of the Big Data vendors/integrators.
What is fascinating about the write up is the wealth of diagrams in the document. I worked through the text and the diagrams and I noticed that one task is not identified as important; specifically, the conversion of source content into a file type or form that the content processing system can process.
Here’s an example. First the IBM diagram:
Source: IBM, Data Warehouse Augmentation, 2014.
Notice that after “staging”, there is a function described in time-honored database speak, “ETL.” Now “extract, transform, and load” is a very important step. But is there a step that precedes ETL?
How can one extract from disparate content if a connector is not available or the source system cannot support file transfers, direct access, or reports that reflect in memory data?
In my experience, there will be different methods of acquiring content to process. There are internal systems. If there is an ancient AS/400, some work is required to generate outputs that provide the data required. Due to the nature of the AS/400, direct interaction with the outstanding memory system of the AS/400, some care is needed to get the data and the updates not yet written to disc without corrupting the in memory information. We have addressed this “memory fragility” by using a standalone machine that accepts an output from the AS/400 and then disconnects. The indexing system, then, connects to the standalone machine to pick up the AS/400 outputs. Clunky? You bet. But there are some upsides. To learn about the excitement of direct interaction with AS/400, just do some real time data acquisition. Let me know how this works out for you.
The same type of care is often needed with the content assembled for the data warehouse pipeline. Let me illustrate this. Assume the data warehouse will obtain data from these sources: internal legacy systems, third party providers, custom crawls with the content residing on a hosted service, and direct data acquisition from mobile devices that feed information into a collection point parked at Amazon.
Now each of these content streams has different feathers in its war bonnet. Some of the data will be well formed XML. Some will be JSON. Some will be a proprietary format unique to the source. For each file type, there will be examples of content objects that are different, due to a vendor format change or a glitch.
These disparate content objects, therefore, have to be processed before extraction can occur. So has IBM put ETL in the wrong place in this diagram or has IBM omitted the pre-processing (normalization) operation.
In our experience, content that cannot be processed is not available to the system. If big chunks of content end up in the exceptions folder, the resulting content processing may be flawed. One of the data points that must be checked is the number of content objects that can be normalized in a pre processing stream. We have encountered situations like these. Your mileage may vary:
- Entire streams of certain types of content are exceptions, so the resulting indexing does not contain the data. Example: outputs from certain intercept systems.
- Streams of content skip non processable content without writing exceptions to a file due to configuration or resource availability
- Streams of content are automatically “capped” when the processing system cannot keep pace. When the system accepts more content, it does not pull information from a cache or storage pool. The system just ignores the information it was unable to process.
There are fixes for each of these situations. What we have learned is that this pre processing function can be very expensive, have an impact on the reliability of the outputs from the data warehousing system when queried, and generate a bottleneck that affects downstream processes.
After decades of data warehousing refinement, why does this problem keep surfacing?
The answer is that recycling traditional thinking about content processing is much easier than figuring out what causes a complex system to derail itself. I think that may be part of the reason the IBM diagram may be misleading.
Pre-processing can be time consuming, hungry for machine resources, and very expensive to implement.
Stephen E Arnold, August 6, 2014
Fortune, Google, and the Seven Deadly Sins
August 6, 2014
I read a darned amazing article at Fortune.com. The story is “The Seven Deadly Sins of Googling.” The article is not about Google. The article is about the humans who use Google.
What I find interesting is that Fortune has reached into the world of cardinal sins. Instead of the ethics embraced by folks, Fortune hooks SALIGIA to using an ad supported online service.
“I don’t have much time. Please, don’t confuse me with facts,” says the modern MBA. Image source: http://gargoyle-statues.hubpages.com/hub/3-Types-of-Gargoyle-Statues-For-Your-Garden
I find the linkage fascinating because it illustrates the type of analysis that seems to be sophisticated with the so called search expertise of Fortune readers, executives, and writers.
I liked the envy section. The article states:
Envy: When you’re jealous of someone else’s Google results. Social media can lead to envy. It can lead, possibly, to depression. In a 2013 study, University of Michigan researchers Ethan Kross and Philippe Verduyn texted people while they were using Facebook, and found that as time on Facebook increased, a person’s mood and overall satisfaction with their lives declined. In other words, Facebook can make you jealous. It can make you feel more alone than connected. Kross and Verduyn didn’t look at other social media networks, but it’s fair to say that looking through lists of other people’s accolades, impressive resumes, and social media clout can just as easily turn you green around the ears.
I found this amusing, although I am not certain that Fortune intended the write up to be funny, even Onionesque.
The meshing of the Seven Deadly Sins with lousy research skills is an example of faux intellectualism. Another recent example is an IDC report that uses the phrase “knowledge quotient” in its title. The reference to cardinal sins sounds good and seems to make sense. “Knowledge quotient” seems to make sense until one looks at how the phrase was used 40 years ago, then the jargon is almost meaningless and little more than an attempt to sound intelligent.
I am encouraged that Fortune is, to some degree, thinking about the dependence business professionals have on the results from a Google query. I am troubled that the information presented is superficial.
There are some important questions to be answered; for example:
- What are the searching and online information behaviors of Fortune readers?
- What specific methods do Fortune readers use to obtain online information?
- What do Fortune readers do to verify the information obtained online?
- What additional research does a Fortune reader do when searching for information?
Answering these questions would provide more useful information. But in the pursuit of Web site traffic, many “real” journalists and publications embrace the listicle.
Is this the 8th deadly sin? Superficiality.
Stephen E Arnold, August 6, 2014
More European Accusations Against Google
August 6, 2014
Despite its previous European legal woes, Google is retains its manipulative ways, at least according to Mathias Döpfner, CEO of Axel Springer, a prominant German media company. Neowin shares with us these latest allegations in, “CEO of European Publishing Giant Accuses Google of Downgrading Rivals’ Search Results.”
Writer Andy Weir reports on a recent BBC radio show, during which both Döpfner and Google communications VP Rachel Whetstone spoke about the issue. Döpfner insisted that Google regularly finesses its search results to its advantage, pointing to one particular algorithm change he says led to a 70% decrease in his company’s site traffic. Whetstone admits that “sometimes we do, sometimes we don’t” promote Google’s products over those of the competition in search results. (Yet she insists their “don’t be evil” motto, which began as “don’t let money affect your search rankings,” remains intact.)
The part that really caught my eye, though, had to do with the European Commission’s curious reaction. Weir writes:
[Döpfner] added that the European Commission’s proposal to deal with this – in response to numerous complaints from businesses large and small across Europe – will simply “make things worse” for companies. As part of that proposal, he said that Google would still be able to downgrade its rivals results, but would be forced to provide advertising space which companies could buy, in order to position their results more prominently against those of Google’s own products.
“This is a very strange proposal,” he continued. “I would call that ‘protection money’. I mean, it is basically the business principles of the Mafia – you say ‘either you pay, or we shoot you’. I think that is not the solution for the problem.”
I can see the reasoning there. Not surprisingly, though, Whetstone disagreed with the comparison.
Cynthia Murrell, August 06, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Google Needs To Watch A Chinese Rival
August 6, 2014
Asia appears to be the place to go for alternative search engines that are large enough to rival Google. Russia has Yandex and now China has created Baidu. Baidu, however, is now crossing oceans and is deployed in Brazil says ZDNet in “Chinese Search Engine Baidu Goes Live In Brazil.” Baidu emigrated to Brazil in 2012, launched free Web services in 2013, and this year the search engine is now available.
Baidu is the second largest search engine with 16.49 percent market share. Google has a little over 70 percent.
Baidu moved to Brazil to snag 43 million users who are predicted to get on the Internet in the next three years. The users are fresh search meet, so they will need a cheap and user-friendly platform. If Baidu gets these people in their Internet infancy, the search engine will probably have them for life.
Baidu also has government support:
“The launch of Baidu in Brazil coincided with a series of agreements between the Brazilian and Chinese governments, also made public yesterday during an official ceremony with Brazilian president Dilma Rousseff and her Chinese counterpart Xi Jinping. These included the creation of a “digital city” in the remote state of Tocantins with funding provided by the Chinese Development Bank and improved partnerships with universities to support the international scholarships program of the Brazilian government.”
Foreign search engines are sneaking up on Google. The monopoly has not toppled yet, but competition is increasing. Google ramps up its battle with Samsung for a smartwatch skirmish. Microsoft could up the ante if they offered Microsoft Office Suite free to rival Google’s free software.
Whitney Grace, August 06, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
A Less Small Thing: Forking Android
August 5, 2014
A few years ago, I was in China. I marveled at the multi-SIM phones. I fiddled with a half dozen models and bought an unlocked GSM phone running Android 2.3. The clerk in the store told me that there would be Android phones without Google. At the time, I was thinking about the fragmentation of Android. In hindsight, I think the clerk in Xian knew a heck of a lot more about the future of Android without Google than I understood. The Chinese manufacturers liked Android but not the Google ball and chain “official Android” required of licensees. Android without Google seems to be a less small thing.
I read “Google Under Threat as Forked Android Devices Rise to 20% of Smartphone Shipments.”The article points out that Android has a market share of 85 percent. The article points out that market share is one thing. Revenue is another. With Web search from traditional computers losing its pride of place, mobile search is a bigger and bigger deal. Unfortunately the money generated by mobile clicks is not the gusher that 2004 style search was. To compensate, Google has been monetizing its silicon heart out. You can read one person’s view of Google search in “Dear Google, I Am Writing an Open Letter from the Search Wilderness.”
I am sure Google will dismiss the NextWeb’s story. I am not so sure. As NextWeb observes, “The company faces a growing issue: The rise of non Google Android.” The real test will be the steps Google takes to pump up the top line and control costs at a time when complaints about Google search are becoming more interesting and compelling.
Stephen E Arnold, August 5, 2014
The March of IBM Watson: From Kitchen to Executive Suite
August 5, 2014
Watson, fresh from its recipe innovations at Bon Appétit, is on the move…again. From the game show to the hospital, Watson has been demonstrating its expertise in the most interesting venues.
I read “A Room Where Executives Go to Get Help from IBM’s Watson.” The subtitle is an SEO dream: “Researchers at IBM are testing a version of Watson designed to listen and contribute to business meetings.” I know IBM has loads of search and content processing capability. In addition to the gems cranked out by Dr. Jon Kleinberg and Dr. Ramanathan Guha, IBM has oodles of acquisitions in the search and content processing sector. Do you know about Clementine? Are you familiar with iPhrase? Have your explored Cybertap’s indexing and search function with your local IBM representative? What about Vivisimo? What about the search functions in DB2, FileNet, and OminFind regardless of its incarnation? Whew. That’s a lot of search and content processing horsepower. I think most of that power remains in the barn.
Watson is not in the barn. Watson is a raging bull. Watson is, I believe, something special. Based on open source technology plus home brew wizardry, Watson is a next-generation information retrieval world beater. The idea is that Watson is trained in a manner similar to the approach used by Autonomy in 1996. Then that indexed content is whipped into a question answering system. Hapless chefs, litigation wary physicians, and now risk averse MBAs can use Watson to make better decisions or answer really tough questions.
I know this to be true because Technology Review tells me so. Whatever MIT-tinged Technology Review says is pretty darned solid. Here’s a passage I noted:
Everything said in the room can be instantly transcribed, providing a detailed record of any meeting, and allowing the system to listen out for commands addressed to “Watson.” Those commands can be simple requests for information of the kind you might type into a search box. But Watson can also take a more active role in a discussion. In a live demonstration, it helped researchers role-playing as executives to generate a short list of companies to acquire.
The write up explains that a little bit of preparation is required. There’s the pesky training, which is particularly annoying when the topic of the meeting is, “The DOJ attorneys are here to discuss the depositions” or “We have a LOCA at the reactor. Everyone to my conference room now.” I suppose most business meetings are even more exciting.
Technology Review points out that the technology has a tough time converting executive speech to text. Watson uses the text as fodder for the indexing and parsing required to pass queries to the internal subsystems which then tap into Watson for answers. The natural language query and automatic query refinement functions seem to work well for game show questions and for discerning uses of tamarind. For a LOCA meeting or discussion of a deposition, Watson may need a bit more work.
I find the willingness of major “real” news outlets to describe Watson in juicy write ups an indication of the esteem in which IBM is held. My view is a bit different. I am not sure the Watson group at IBM knows how to generate substantial revenues. The folks have to make some progress toward $1 billion in revenue and then grow that revenue to a modest $10 billion in five or six years.
The fact that outfits in search and content processing have failed to hit more modest benchmarks for decades is irrelevant. The only search company that I know has generated billions is Google. Keep in mind that those billions come from online advertising. HP bought Autonomy for $11 billion in the hopes of owning a Klondike. IBM wisely went with open source technology and home grown code.
But the eventual effect of both HP’s and IBM’s approach will be more modest revenues. HP makes a name for itself via litigation and IBM is making a name for itself with demonstrations and some recipes.
Search and content processing, whether owned by a large company or a small one, faces some credibility, marketing, revenue, technology, and profit challenges. I am not sure a business triathlete can complete the course at this time. Talk is just so much easier than getting over or around the course intact.
Stephen E Arnold, August 5, 2014