August 12, 2014
The article titled SLI Systems Plunges to Lowest Since Listing on TVNZ discusses the recent burst of listings. SLI Systems is a company that provides site search, navigation and “user-generated SEO.” SLI’s share price shows the pressure findability vendors are facing in today’s marketplace. The stock fell over seven percent and remains just above its initial public offer price of $1.15. The article states,
“The local stock market is experiencing a flurry of listings which is spoiling investors for choice after it got a shot in the arm from the government’s partial privatisation last year, and the recent listings of software developers Gentrack Group and Serko have only added to tech investments available. Next week, IkeGPS Group, which sells a range of portable measuring devices, plans to list while Vista Entertainment, the cinema software and data analytics company, is due in August…”
Paul Harrison of Salt Funds Management, believes that the flood of listings is not the only culprit for falling prices. Instead, he suggests that certain stocks were simply priced too highly and the current downward trend is a “hangover” following the initial “frenzy.” Other affected companies mentioned include Xero, the accounting software firm, the biotech company Pacific Edge which was unchanged, and Diligent, which also fell in price.
Chelsea Kerwin, August 12, 2014
August 12, 2014
An article on the Library Journal Infodocket is titled Co-Founder of Vivisimo Launches “OnlyBoth” and It’s Super Cool! The article continues in this entirely unbiased vein. OnlyBoth, it explains, was created by Raul Valdes- Perez and Andre Lessa. It offers an automated process of finding data and delivering it to the user in perfect English. The article states,
“What does OnlyBoth do? Actions speak louder than words so go take a look but in a nutshell, OnlyBoth can mine a dataset, discover insights, and then write what it finds in grammatically correct sentences. The entire process is automated. At launch, OnlyBoth offers an application providing insights o 3,122 U.S. colleges and universities described by 190 attributes. Entries also include a list of similar and neighboring institutions. More applications are forthcoming.”
The article suggests that this technology will easily lend itself to more applications, for now it is limited to presenting the facts about colleges and baseball in perfect English. The idea is called “niche finding” which Valedes-Perez developed in the early 2000s and never finished. The technology focuses on factual data that requires some reasoning. For example, the Onlyboth website suggests that the insight “If California were a country, it would be the tenth biggest in the world” is a more complicated piece of information than just a simple fact like the population of California. OnlyBoth promises that more applications are forthcoming.
Chelsea Kerwin, August 12, 2014
August 11, 2014
I know that Googlers and Xooglers are absolutely the best. I read “Ex-Google Engineer to Lead Fix-It Team for Government Websites.” I am confident that the Xoogler will bring high magic to the problematic Web sites from numerous Federal entities and quasi-government entities. In year 2000, there were 36,000 of these puppies. I don’t recall how many were not working the way the developers intended.
I don’t know how many US government Web sites there are today because the nifty free tools I used in 2000 and 2001 the way they did a decade ago.
How long will it take to address the backend issues of HealthCare.gov or get the other sites with glitches working “just like Google”? I think USA.gov might warrant a quick look too. I suppose one could check out the performance metrics for America Online or Yahoo, two outfits run by Xooglers. There may be some data that help in predicting the fix time.
Stephen E Arnold, August 11, 2014
August 8, 2014
The early days of Internet search always yielded a myriad of search results. No two searches were ever alike and sponsored ads never made it to the top, because they were not around much. It was especially fun, because you go to see more personal, less corporate content. Now search results are so cluttered, albeit more accurate results and with paid links. Given that humans are also creatures of habit, we tend not to stray far from out safe surfing paths and shockingly the Internet can become a boring place.
Makeuseof.com wrote “Discover Interesting Content With Five Ways To Randomize The Internet” and it points out some neat ways to discover new information. It highlights basic ways: Random Wikipedia, random Google Street View, random YouTube, and random Reddit. For all of these be prepared to get sucked into Internet linkage, videos, and photos for hours if you use any of these tools of randomness. Random Website takes users to any random Web site in its generator.
“How often do you find yourself on the Internet looking at the same boring pages? You know there is something out there but you don’t know where to look. Trust me, how bad could it be?”
What is fun is being taken to dark pages of Web 1.0 or a Web site that serves no purpose other than hosting a single word on a single page.
A lot of Internet content is weird, as seen by using these tools, but some of it can lead you to new thoughts and interests. If you need a metaphor, imagine the Internet is like an encyclopedia, except the entries never end and contain all the information about a topic instead of a short summary.
August 7, 2014
Sphere Engineering is looking to reinvent the way Web information is organized with QuickAnswers.io. This search engine returns succinct answers to questions instead of results lists. More a narrowed Wolfram|Alpha than a Google. At least that’s the idea. So far, though, it’s a great place to ask a question—as long as it’s a question to which the system knows the answer. I tried a few queries and got back almost as many “sorry, I don’t know”s or nonsense responses. For now, at least, the page admits that “the current state of this project only reflects a tiny fraction of what is possible.” Still, it may be worth checking back in as the system progresses.
The company’s blog post about the project lets us in on the vision of what QuickAnswers could become. Software engineer François Chollet writes:
“I recently completed a total rewrite of QuickAnswers.io, based on a new algorithm. I call it ‘shallow QA’, as opposed to IBM Waston’s ‘deep QA’. IBM Watson keeps a large knowledge model available for queries and thus requires a supercomputer to run. At the other end of the spectrum, QuickAnswers.io generates partial knowledge models on the fly and can run on a micro-instance.
“QuickAnswers.io is a semantic question answering engine, capable of providing quick answer snippets to any question that can be answered with knowledge found on the web. It’s like a specialized, quicker version of a search engine. You can see a quick overview of the previous version here.”
The description then gets technical. Chollet uses several examples to illustrate the algorithm’s approach, the results, and some of the challenges he’s faced. He also explains his ambitious long-range vision:
“In the longer term, I’d like to read the entirety of the web and build a complete semantic Bayesian map matching a maximum of knowledge items. Also, it would be nice to have access to a visualization tool for the different answers available and their frequency across sectors of opinion, thus solving the problem of subjectivity.”
These are some good ideas, but of course implementation is the tough part. We should keep an eye on these folks to see whether those ideas make it to fruition. While pursuing such visionary projects, Sphere Engineering earns its dough by building custom machine-learning and data-mining solutions.
Cynthia Murrell, August 07, 2014
August 7, 2014
Anyone on the lookout for a free intranet search system? FreewareFiles offers Arch Search Engine 1.7, also known as CSIRO Arch. The software will eat up 22.28MB, and works on both 32-bit and 64-bit systems running Windows 2000 through Windows 7 or MacOS or MacOS X. Here’s part of the product description:
Arch is an open source extension of Apache Nutch (a popular, highly scalable general purpose search engine) for intranet search. Not happy with your corporate search engine? No surprise, very few people are. Arch (finally!) solves this problem. Don’t believe it? Try Arch, blind test evaluation tools are included.
In addition to excellent search quality, Arch has many features critical for corporate environments, such as document level security.
*Excellent search quality: Arch has solved the problem of providing good search results for corporate web sites and intranets!
*Up to date information: Arch is very efficient at updating indexes and this ensures that the search results are up to date and relevant. Unlike most search engines, no complete ‘recrawls’ are done. The indexes can be updated daily, with new pages discovered automatically.
*Multiple web sites: Arch supports easy dynamic inclusion or removal of websites.
They also say the system is easy to install and maintain; uses two indexes so there’s always a working one; and is customizable with either Java or PHP.
Cynthia Murrell, August 07, 2014
August 6, 2014
Asia appears to be the place to go for alternative search engines that are large enough to rival Google. Russia has Yandex and now China has created Baidu. Baidu, however, is now crossing oceans and is deployed in Brazil says ZDNet in “Chinese Search Engine Baidu Goes Live In Brazil.” Baidu emigrated to Brazil in 2012, launched free Web services in 2013, and this year the search engine is now available.
Baidu is the second largest search engine with 16.49 percent market share. Google has a little over 70 percent.
Baidu moved to Brazil to snag 43 million users who are predicted to get on the Internet in the next three years. The users are fresh search meet, so they will need a cheap and user-friendly platform. If Baidu gets these people in their Internet infancy, the search engine will probably have them for life.
Baidu also has government support:
“The launch of Baidu in Brazil coincided with a series of agreements between the Brazilian and Chinese governments, also made public yesterday during an official ceremony with Brazilian president Dilma Rousseff and her Chinese counterpart Xi Jinping. These included the creation of a “digital city” in the remote state of Tocantins with funding provided by the Chinese Development Bank and improved partnerships with universities to support the international scholarships program of the Brazilian government.”
Foreign search engines are sneaking up on Google. The monopoly has not toppled yet, but competition is increasing. Google ramps up its battle with Samsung for a smartwatch skirmish. Microsoft could up the ante if they offered Microsoft Office Suite free to rival Google’s free software.
August 1, 2014
A new search engine has appeared on the radar called Niice. It is focused on inspiration search and the presentation of quality images to spark ideas. The Niice blog lays out their mission statement (to be a sort of upscale Google for graphic designers, photographers, tasteful people in general) as well as stating their design principles. These include remaining safe work, not getting in the way, and being restrained in content choices. They also stress embracing serendipity,
“We want to give you results that you don’t expect, presented in a way that inspires your brain to make new connections. Niice isn’t for finding an image that you can copy, it’s for bringing together lots of ideas for you to combine into something new…. The internet is full of inspiration, but since Google doesn’t have a ‘Good Taste’ filter, finding it means jumping back and forth between blogs and gallery sites.”
Somewhat like Pinterest, Niice allows users to create “moodboards” or collections of images which can be saved, collaborated on, or downloaded as JPEGs. When I searched the term “water”, a collage of images appeared that included photographs of ocean waves and a sunset over a lake, a glass sculpture resembling a dew drop, and a picture that linked to a story on an artist who manipulates water with her mind among many others.
Chelsea Kerwin, August 01, 2014
July 31, 2014
I am not an attorney. I consider this a positive. I am not a PhD with credentials as impressive Vladimir Igorevich Arnold, my distant relative. He worked with Andrey Kolmogorov, who was able to hike in some bare essentials AND do math at the same time. Kolmogorov and Arnold—both interesting, if idiosyncratic, guys. Hiking in the wilderness with some students, anyone?
Now to the matter at hand. Last night I sat down with a copy of US 8,666,730 B2 (hereinafter I will use this shortcut for the patent, 730), filed in an early form in 2009, long before Information Handing Service wrote a check to the owners of The Invention Machine.
The title of the system and method is “Question Answering System and Method Based on Semantic Labeling of Text Documents and User Questions.” You can get your very own copy at www.uspto.gov. (Be sure to check out the search tips; otherwise, you might get a migraine dealing with the search system. I heard that technology was provided by a Canadian vendor, which seems oddly appropriate if true. The US government moves in elegant, sophisticated ways.
Well, 730 contains some interesting information. If you want to ferret out more details, I suggest you track down a friendly patent attorney and work through the 23 page document word by word.
My analysis is that of a curious old person residing in rural Kentucky. My advisors are the old fellows who hang out at the local bistro, Chez Mine Drainage. You will want to keep this in mind as I comment on this James Todhunter (Framingham, Mass), Igor Sovpel (Minsk, Belarus), and Dzianis Pastanohau (Minsk, Belarus). Mr. Todhunter is described as “a seasoned innovator and inventor.” He was the Executive Vice President and Chief Technology Officer for Invention Machine. See http://bit.ly/1o8fmiJ, Linked In at (if you are lucky) http://linkd.in/1ACEhR0, and this YouTube video at http://bit.ly/1k94RMy. Igor Sovpel, co inventor of 730, has racked up some interesting inventions. See http://bit.ly/1qrTvkL. Mr. Pastanohau was on the 730 team and he also helped invent US 8,583,422 B2, “System and Method for Automatic Semantic Labeling of Natural Language Texts.”
The question answering invention is explained this way:
A question-answering system for searching exact answers in text documents provided in the electronic or digital form to questions formulated by user in the natural language is based on automatic semantic labeling of text documents and user questions. The system performs semantic labeling with the help of markers in terms of basic knowledge types, their components and attributes, in terms of question types from the predefined classifier for target words, and in terms of components of possible answers. A matching procedure makes use of mentioned types of semantic labels to determine exact answers to questions and present them to the user in the form of fragments of sentences or a newly synthesized phrase in the natural language. Users can independently add new types of questions to the system classifier and develop required linguistic patterns for the system linguistic knowledge base.
The idea, as I understand it, is that I can craft a question without worrying about special operators like AND or field labels like CC=. Presumably I can submit this type of question to a search system based on 730 and its related inventions like the automatic indexing in 422.
The references cited for this 2009 or earlier invention are impressive. I recognized Mr. Todhunter’s name, that of a person from Carnegie Mellon, and one of the wizards behind the tagging system in use at SAS, the statistics outfit loved by graduate students everywhere. There were also a number of references to Dr. Liz Liddy, Syracuse University. I associated her with the mid to late 1990s system marketed then as DR LINK (Document Retrieval Linguistic Knowledge). I have never been comfortable with the notion of “knowledge” because it seems to require that subject matter experts and other specialists update, edit, and perform various processes to keep the “knowledge” from degrading into a ball of statistical fuzz. When someone complains that a search system using Bayesian methods returns off point results, I look for the humans who are supposed to perform “training,” updates, remapping, and other synonyms for “fixing up the dictionaries.” You may have other experiences which I assume are positive and have garnered you rapid promotion for your search system competence. For me, maintaining knowledge bases usually leads to lots of hard work, unanticipated expenses, and the customary termination of a scapegoat responsible for the search system.
I am never sure how to interpret extensive listings of prior art. Since I am not qualified to figure out if a citation is germane, I will leave it to you to wade through the full page of US patent, foreign patent documents, and other publications. Who wants to question the work of the primary examiner and the Faegre Baker Daniels “attorney, agent, or firm” tackling 730.
On to the claims. The patent lists 28 claims. Many of them refer to operations within the world of what the inventors call expanded Subject-Action-Object or eSAO. The idea is that the system figures out parts of speech, looks up stuff in various knowledge bases and automatically generated indexes, and presents the answer to the user’s question. The lingo of the patent is sufficiently broad to allow the system to accommodate an automated query in a way that reminded me of Ramanathan Guha’s massive semantic system. I cover some of Dr. Guha’s work in my now out of print monograph, Google Version 2.0, published by one of the specialist publishers that perform Schubmehl-like maneuvers.
My first pass through the 730’s claims was a sense of déjà vu, which is obviously not correct. The invention has been award the status of a “patent”; therefore, the invention is novel. Nevertheless, these concepts pecked away at me with the repetitiveness of the woodpecker outside my window this morning:
- Automatic semantic labeling which I interpreted as automatic indexing
- Natural language process, which I understand suggests the user takes the time to write a question that is neither too broad nor too narrow. Like the children’s story, the query is “just right.”
- Assembly of bits and chunks of indexed documents into an answer. For me the idea is that the system does not generate a list of hits that are probably germane to the query. The Holy Grail of search is delivering to the often lazy, busy, or clueless user an answer. Google does this for mobile users by looking at a particular user’s behavior and the clusters to which the user belongs in the eyes of Google math, and just displaying the location of the pizza joint or the fact that a parking garage at the airport has an empty space.
- The system figures out parts of speech, various relationships, and who-does-what-to-whom. Parts of speech tagging has been around for a while and it works as long as the text processed in not in the argot of a specialist group plotting some activity in a favela in Rio.
- The system performs the “e” function. I interpreted the “e” to mean a variant of synonym expansion. DR LINK, for example, was able in 1998 to process the phrase white house and display content relevant to presidential activities. I don’t recall how this expansion from bound phrase to presidential to Clinton. I do recall that DR LINK had what might be characterized as a healthy appetite for computing resources to perform its expansions during indexing and during query processing. This stuff is symmetrical. What happens to source content has to happen during query processing in some way.
- Relevance ranking takes place. Various methods are in use by search and content processing vendors. Some of based on statistical methods. Others are based on numerical recipes that the developer knows can be computed within the limits of the computer systems available today. No N=NP, please. This is search.
- There are linguistic patterns. When I read about linguistic patterns I recall the wild and crazy linguistic methods of Delphes, for example. Linguistics are in demand today and specialist vendors like Bitext in Madrid, Spain, are in demand. English, Chinese, and Russian are widely used languages. But darned useful information is available in other languages. Many of these are kept fresh via neologisms and slang. I often asked my intelligence community audiences, “What does teddy bear mean?” The answer is NOT a child’s toy. The clue is the price tag suggested on sites like eBay auctions.
The interesting angle in 730 is the causal relationship. When applied to processes in the knowledge bases, I can see how a group of patents can be searched for a process. The result list could display ways to accomplish a task. NOTting out patents for which a royalty is required leaves the searcher with systems and methods that can be used, ideally without any hassles from attorneys or licensing agents.
Several questions popped into my mind as I reviewed the claims. Let me highlight three of these:
First, computational load when large numbers of new documents and changed content has to be processed. The indexes have to be updated. For small domains of content like 50,000 technical reports created by an engineering company, I think the system will zip along like a 2014 Volkswagen Golf.
Source: US8666730, Figure 1
When terabytes of content arrived every minute, then the functions set forth in the block diagram for 730 have to be appropriately resourced. (For me, “appropriately resourced” means lots of bandwidth, storage, and computational horsepower.)
Second, the knowledge base, as I thought about when I first read the patent, has to be kept in tip top shape. For scientific, technical, and medical content, this is a more manageable task. However, when processing intercepts in slang filled Pashto, there is a bit more work required. In general, high volumes of non technical lingo become a bottleneck. The bottleneck can be resolved, but none of the solutions are likely to make a budget conscious senior manager enjoy his lunch. In fact, the problem of processing large flows of textual content is acute. Short cuts are put in place and few of those in the know understand the impact of trimming on the results of a query. Don’t ask. Don’t tell. Good advice when digging into certain types of content processing systems.
Third, the reference to databases begs this question, “What is the amount of storage required to reduce index latency to less than 10 seconds for new and changed content?” Another question, “What is the gap that exists for a user asking a mission critical question between new and changed content and the indexes against which the mission critical query is passed?” This is not system response time, which as I recall for DR LINK era systems was measured in minutes. The user sends a query to the system. The new or changed information is not yet in the index. The user makes a decision (big or small, significant or insignificant) based on incomplete, incorrect, or stale information. No big problem is one is researching a competitor’s new product. Big problem when trying to figure out what missile capability exists now in an region of conflict.
My interest is enterprise search. IHS, a professional publishing company that is in the business of licensing access to its for fee data, seems to be moving into the enterprise search market. (See http://bit.ly/1o4FyL3.) My researchers (an unreliable bunch of goslings) and I will be monitoring the success of IHS. Questions of interest to me include:
- What is the fully loaded first year cost of the IHS enterprise search solution? For on premises installations? For cloud based deployment? For content acquisition? For optimization? For training?
- How will the IHS system handle flows of real time content into its content processing system? What is the load time for 100 terabytes of text content with an average document size of 50 Kb? What happens to attachments, images, engineering drawings, and videos embedded in the stream as native files or as links to external servers?
- What is the response time for a user’s query? How does the user modify a query in a manner so that result sets are brought more in line with what the user thought he was requesting?
- How do answers make use of visual outputs which are becoming increasingly popular in search systems from Palantir, Recorded Future, and similar providers?
- How easy is it to scale content processing and index refreshing to keep pace with the doubling of content every six to eight weeks that is becoming increasingly commonplace for industrial strength enterprise search systems? How much reengineering is required for log scale jumps in content flows and user queries?
Take a look at 730 an d others in the Invention Machine (IHS) patent family. My hunch is that if IHS is looking for a big bucks return from enterprise search sales, IHS may find that its narrow margins will be subjected to increased stress. Enterprise search has never been nor is now a license to print money. When a search system does pump out hundreds of millions in revenue, it seems that some folks are skeptical. Autonomy and Fast Search & Transfer are companies with some useful lessons for those who want a digital Klondike.
July 29, 2014
I received an email about the new “www.Search CIO.com” Here it is:
I was not aware of the old search CIO. I clicked a link that delivered me to a page asking me to log in. I ignored that and navigated to the search box and entered the query “failure.” The system responded with 13,060 articles with the word failure in them, 103 conversations, and 219 definitions of failure.
The first hit was to an IBM mainframey problem with a direct access storage device. Remember those from 2003 and before? The second hit was a 2002 definition about “failure protection.”
The new search system appears to pull matching strings from articles and content objects across TechTarget’s different publications/information services. I clicked on the DASD failure link and was enjoined to sign up for a free membership. Hmmm. Okay. Plan B.
In my lectures about tactics for getting useful open source information, I focus on services like Ixquick.com. No registration and no invasive tracking. The Ixquick.com approach is different from the marketing-oriented, we want an email address in use at www.searchcio.com. Here’s what Ixquick displayed, quite quickly as well:
The first hit was to a detailed chunk of information from IBM called “DASD Ownership Notification (DVHXDN). No registration required. The hits were on point and quite useful in my opinion. A happy quck for Ixquick.
If you have an appetite for TechTarget information, navigate to http://searchcio.techtarget.com/. If you want helpful search results from a pretty good metasearch engine, go for Ixquick.
Stephen E Arnold, July 30, 2014