Microsoft Azure Plans Offers Goldilocks and Three Bears Strategy to Find Perfect Fit

April 15, 2016

The article on eWeek titled Microsoft Debuts Azure Basic Search Tier relates the perks of the new plan from Microsoft, namely, that it is cheaper than the others. At $75 per month (and currently half of for the preview period, so get it while it’s hot!) the Basic Azure plan has lower capacity when it comes to indexing, but that is the intention. The completely Free plan enables indexing of 10,000 documents and allows for 50 megabytes of storage, while the new Basic plan goes up to a million documents. The more expensive Standard plan costs $250/month and provides for up to 180 million documents and 300 gigabytes of storage. The article explains,

“The new Basic tier is Microsoft’s response to customer demand for a more modest alternative to the Standard plans, said Liam Cavanagh, principal program manager of Microsoft Azure Search, in a March 2 announcement. “Basic is great for cases where you need the production-class characteristics of Standard but have lower capacity requirements,” he stated. Those production-class capabilities include dedicated partitions and service workloads (replicas), along with resource isolation and service-level agreement (SLA) guarantees, which are not offered in the Free tier.”

So just how efficient is Azure? Cavanagh stated that his team measured the indexing performance at 15,000 documents per minute (although he also stressed that this was with batches organized into groups of 1,000 documents.) With this new plan, Microsoft continues its cloud’s search capabilities.

 

 

Chelsea Kerwin, April 15,  2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Google Hummingbird Revealed by a Person Not Working for Google

April 7, 2016

Another wizard has scrutinized the Google and figured out how to make sure your site becomes number one with a bullet.

To get the wisdom, navigate to “Hummingbird – Mastering the art of Conversational Search.” The problem for the GOOG is that it costs a lot of money to index Web sites no one visits. Advertisers want traffic. That means the GOOG has to find a way to reduce costs and sell either more ads or fewer ads at a higher price.

The write up pays scant attention to the realities of the Google. But you will learn the tips necessary to work traffic magic. Okay, I don’t get too excited about info about Google from folks who are not working at the company or who have worked at the company. Sorry. Looking at the Google and reading tea leaves does not work for me.

But what works, according to the write up, are these sure fire tips. Here we go:

  1. Bone up on latent semantic indexing. Let’s see. That method has been around for 30, maybe 40 years. Get a move on, gentle reader.
  2. Make your Web site mobile friendly. Unfortunately mobile Web sites don’t get more traffic than a regular Web site which does not get much traffic. Sorry. The majority of clicks flow to a small percentage of the accessible Web sites.
  3. Forget the keyword thing. Well, I usually use words to write my articles and Web sites. I worry about focusing on a small number of topics and using the words necessary to get my point across. Keywords, in my opinion, are derivatives of information. Forgetting keywords is easy. I never used them before.
  4. Make your write ups accurate. Okay, that’s a start. What does one do with “real” news from certain sources. The info is baloney, but everyone pretends it is accurate. What’s up with that? The accuracy angle is part of Google’s scoring methods. Each has to deal with what’s correct in his or her own way. Footnotes and links are helpful. What happens when someone disagrees. Is this “accurate”? Oh, well.
  5. “Be bold and broad.” In my experience, not much content is bold and broad.

Now you understand Google Hummingbird. Will your mobile Web site generate hundreds of thousands of uniques if you adhere to this road map? Nah. Why not follow Google’s guidelines from the Google itself?

Stephen E Arnold, April 7, 2016

Search as a Framework

March 26, 2016

A number of search and content processing vendors suggest their information access system can function as a framework. The idea is that search is more than a utility function.

If the information in the article “Abusing Elasticsearch as a Framework” is spot on, a non search vendor may have taken an important step to making an assertion into a reality.

The article states:

Crate is a distributed SQL database that leverages Elasticsearch and Lucene. In it’s infant days it parsed SQL statements and translated them into Elasticsearch queries. It was basically a layer on top of Elasticsearch.

The idea is that the framework uses discovery, master election, replication, etc along with the Lucene search and indexing operations.

Crate, the framework, is a distributed SQL database “that leverages Elasticsearch and Lucene.”

Stephen E Arnold, March 26, 2016

Ixquick and StartPage Become One

March 25, 2016

Ixquick was created by a person in Manhattan. Then the system shifted from the USA to Europe. I lost track. I read “Ixquick Merges with StartPage Search Engine.” Web search is a hideously expensive activity to fund. Costs can be suppressed if one just passes the user’s query to Bing, Google, or some other Web indexing search system. The approach delivers what is called a value-added opportunity. Vivisimo used the approach before it morphed into a unit of IBM and emerged not as a search federation system but a Big Data system. Most search traffic flows to the Alphabet Google advertising system. Those who use federated search systems often don’t know the difference and, based on my observations, don’t care.

According to the write up:

The main difference between StartPage and the current version of Ixquick is that the former is powered exclusively by Google search results while the latter aggregates data from multiple search engines to rank them based on factors such as prominence and quantity. Both search engines are privacy orientated, and the merging won’t change the fact. IP addresses are not recorded for instance, and data is not shared with third-parties.

Like DuckDuckGo.com, Ixquick.com and StartPage.com “protect the user’s privacy. My thought is that I am not confident Tor sessions are able to protect a user’s privacy. A general interest search engine which delivers on this assertion is interesting indeed.

If you want to use the Ixquick function that presents only Google results, navigate to www.ixquick.eu. There are other privacy oriented systems; for example, Gibiru and Unbubble.

Sorry, I won’t/can’t go into the privacy angle. You may want to poke around how secure a VPN session, Tails, and Tor are. The exploration may yield some useful information. Make sure your computing device does not have malware installed, please. Otherwise, the “privacy” issue is off the table.

Stephen E Arnold, March 25, 2016

DocPoint and Concept Searching: The ONLY Choice. Huh?

March 24, 2016

DocPoint is a consulting and services firm focusing on the US government’s needs. The company won’t ignore commercial firms’ inquiries, but the line up of services seems to be shaped for the world of GSAAdvantage users.

I noted that DocPoint has signed on to resell the Concept Searching indexing system. In theory, the SharePoint search service performs a range of indexing functions. In actual practice, like my grandmother’s cookies, many of the products are not cooked long enough. I tossed those horrible cookies in the trash. The licensees of SharePoint don’t have the choice I did when eight years old.

DocPoint is a specialist firm which provides what Microsoft cannot or no longer chooses to offer its licensees. Microsoft is busy trying to dominate the mobile phone market and doing bug fixes on the Surface product line.

The scoop about the DocPoint and Concept Searching deal appears in “DocPoint Solutions Adds Concept Searching To GSA Schedule 70.” The Schedule 70 reference means, according to WhatIs.com:

a long-term contract issued by the U.S. General Services Administration (GSA) to a commercial technology vendor.  Award of a Schedule contract signifies that the GSA has determined that the vendor’s pricing is fair and reasonable and the vendor is in compliance with all applicable laws and regulations. Purchasing from pre- approved vendors allows agencies to cut through red tape and receive goods and services faster. A vendor doesn’t need to win a GSA Schedule contract in order to do business with U.S. government agencies, but having a Schedule contract can cut down on administrative costs, both for the vendor and for the agency. Federal agencies typically submit requests to three vendors on a Schedule and choose the vendor that offers the best value.

To me, the deal is a way for Concept Searching to generate revenue via a third party services firm.

In the write up about the tie up, I highlighted this paragraph which is a single paragraph with an amazing assertion:

A DocPoint partner since 2012, Concept Searching is the only [emphasis added] company whose solutions deliver automatic semantic metadata generation, auto-classification, and powerful taxonomy tools running natively in all versions of SharePoint and SharePoint Online. By blending these technologies with DocPoint’s end-to-end enterprise content management (ECM) offerings, government organizations can maximize their SharePoint investment and obtain a fully integrated solution for sharing, securing and searching for mission-critical information.

Note the statement “only company whose solutions deliver…” “Only” means, according to the Google define function:

No one or nothing more besides; solely or exclusively.

Unfortunately the DocPoint assertion about Concept Searching as the only firm appears to be wide of the mark. Concept Search is one of many companies offering the functions set forth in the content marketing “news” story. In my files, I have the names of dozens of commercial firms offering semantic metadata generation, auto-classification, and taxonomy tools. I wonder if Layer2 or Smartlogic have an opinion about “only”?

Stephen E Arnold, March 24, 2016

DeepGram: Audio Search in Lectures and Podcasts

March 23, 2016

I read “DeepGram Lets You Search through Lectures and Podcasts for Your Favorite Quotes.” I don’t think the system is available at this time. The article states:

Search engines make it easy to look through text files for specific words, but finding phrases and keywords in audio and video recordings could be a hassle. Fortunately, California-based startup DeepGram is working on a tool that will make this process simpler.

The hint is the “is working.” Not surprisingly, the system is infused with artificial intelligence. The process is to covert speech to text and then index the result.

Exalead had an interesting system seven or eight years ago. I am not sure what happened to that demonstration. My recollection is that the challenge is to have sufficient processing power to handle the volume of audio and video content available for indexing.

When an outfit like Google is not able to pull off a comprehensive search system for its audio and video content, my hunch is that the task for a robust volume of content might be a challenge.

But if there is sufficient money, engineering talent, and processing power, perhaps I will no longer have to watch serial videos and listen to lousy audio to figure out what some folks are trying to communicate in their presentations.

Stephen E Arnold, March 23, 2016

 

Interview with Stephen E Arnold, Reveals Insights about Content Processing

March 22, 2016

Nikola Danaylov of the Singularity Weblog interviewed technology and financial analyst Stephen E. Arnold on the latest episode of his podcast, Singularity 1 on 1. The interview, Stephen E. Arnold on Search Engines and Intelligence Gathering, offers thought-provoking ideas on important topics related to sectors — such as intelligence, enterprise search, and financial — which use indexing and content processing methods Arnold has worked with for over 50 years.

Arnold attributes the origins of his interest in technology to a programming challenge he sought and accepted from a computer science professor, outside of the realm of his college major of English. His focus on creating actionable software and his affinity for problem-solving of any nature led him to leave PhD work for a job with Halliburton Nuclear. His career includes employment at Booz, Allen & Hamilton, the Courier Journal & Louisville Times, and Ziff Communications, before starting ArnoldIT.com strategic information services in 1991. He co-founded and sold a search system to Lycos, Inc., worked with numerous organizations including several intelligence and enforcement organizations such as US Senate Police and General Services Administration, and authored seven books and monographs on search related topics.

With a continued emphasis on search technologies, Arnold began his blog, Beyond Search, in 2008 aiming to provide an independent source of “information about what I think are problems or misstatements related to online search and content processing.” Speaking to the relevance of the blog to his current interest in the intelligence sector of search, he asserts:

“Finding information is the core of the intelligence process. It’s absolutely essential to understand answering questions on point and so someone can do the job and that’s been the theme of Beyond Search.”

As Danaylov notes, the concept of search encompasses several areas where information discovery is key for one audience or another, whether counter-terrorism, commercial, or other purposes. Arnold agrees,

“It’s exactly the same as what the professor wanted to do in 1962. He had a collection  of Latin sermons. The only way to find anything was to look at sermons on microfilm. Whether it is cell phone intercepts, geospatial data, processing YouTube videos uploaded from a specific IP address– exactly the same problem and process. The difficulty that exists is that today we need to process data in a range of file types and at much higher speeds than ever anticipated, but the processes remain the same.”

Arnold explains the iterative nature of his work:

“The proof of the value of the legacy is I don’t really do anything new, I just keep following these themes. The Dark Web Notebook is very logical. This is a new content domain. And if you’re an intelligence or information professional, you want to know, how do you make headway in that space.”

Describing his most recent book, Dark Web Notebook, Arnold calls it “a cookbook for an investigator to access information on the Dark Web.” This monograph includes profiles of little-known firms which perform high-value Dark Web indexing and follows a book he authored in 2015 called CYBEROSINT: Next Generation Information Access.

Read more

Yellowfin: Emulating i2 and Palantir?

March 22, 2016

I read “New BI Platform Focuses on Collaboration, Analytics.” What struck me about this explanation of a new version of YellowFin is that the company is adding the type of features long considered standard in law enforcement and intelligence. The idea is that visualizations and collaboration are components of a commercial business intelligence solution.

I noted this paragraph:

Other BI vendors have tried to push data preparation and analysis responsibilities onto business users “because it’s easier to adapt what they have to fulfill that goal.” But Yellowfin “isn’t a BI tool attempting to make the business user a techie. It is about presenting data to users in an attractive visual representation, backed-up with some of the most sophisticated collaboration tools embedded into a BI platform on the market.”

The reason for analyst involvement in the loading of data is a way to eliminate the issue of content ownership, indexing, and knowledge of what is in the system’s repository. I am not confident that any system which allows the user to whack away at whatever data have been processed by the system is ready for prime time. Sure, Google can win at Go, but the self driving auto ran into a bus.

The write up, which strikes me as New Age public relations, seems to want me to remember what’s new with YellowFin with this mnemonic example: Curated. Baffled? Here’s what curated means:

  • Consistent: Governed, centralized and managed
  • Usable: by any business to consume analytics
  • Relevant: connected to all the data users need to do their jobs well
  • Accurate: data quality is paramount
  • Timely: Provide real time data and agile content development
  • Engaging: Offer a social or collaborative component
  • Deployed: widely across the organization.

Business intelligence is the new “enterprise search.” I am not sure the use of notions like curated and adding useful functions delivers the impact that some marketers promise. Remember that self driving car. Pesky humans.

Stephen E Arnold, March 23, 2016

Search: Gone and Replaced. A Research Delight

March 17, 2016

The notion of indexing “all the world’s information” is an interesting one. I am amused by the assumption some folks make that Bing, Google, and Yandex index “every” Web site and “all” content.

I read “China Has Unblocked Internet Searches That Refer to Kim Jong Un As a ‘Pig’.” The article is a reminder that finding information can be a very difficult business.

According to the write up from an outfit rumored to be interested in some of the Yahooligans’ online business, I learned:

China appears to have made an exception within its extremely restricted Internet this week, for an unusual search term — a reference to North Korean dictator Kim Jong Un as a “third-generation pig.”

What other items are back online? Heck, what books are available in digital form in any country? I do find the animal reference interesting, however. I am baffled by the concept of third-generation.

When you run a query, do you get access to “all” information, or is the entire digital information access environment subject to filtering. Maybe third generation filtering?

Stephen E Arnold, March 17, 2016

Enterprise Search Revisionism: Can One Change What Happened

March 9, 2016

I read “The Search Continues: A History of Search’s Unsatisfactory Progress.” I noted some points which, in my opinion, underscore why enterprise search has been problematic and why the menagerie of experts and marketers have put search and retrieval on the path to enterprise irrelevance. The word that came to mind when I read the article was “revisionism” for the millennials among us.

The write up ignores the fact that enterprise search dates back to the early 1970s. One can argue that IBM’s Storage and Information Retrieval System (STAIRS) was the first significant enterprise search system. The point is that enterprise search as a productized service has a history of over promising and under delivering of more than 40 years.

image.pngEnterprise search with a touch of Stalinist revisionism.

Customers said they wanted to “find” information. What those individuals meant was have access to information that provided the relevant facts, documents, and data needed to deal with a problem.

Because providing on point information was and remains a very, very difficult problem, the vendors interpreted “find” to mean a list of indexed documents that contained the users’ search terms. But there was a problem. Users were not skilled in crafting queries which were essentially computer instructions between words the index actually contained.

After STAIRS came other systems, many other systems which have been documented reasonably well in Bourne and Bellardo-Hahn’s A History of Online information Services 1963-1976. (The period prior to 1970 describes for-fee research centric online systems. STAIRS was among the most well known early enterprise information retrieval system.)  I provided some history in the first three editions of the Enterprise Search Report, published from 2003 to 2007. I have continued to document enterprise search in the Xenky profiles and in this blog.

The history makes painful reading for those who invested in many search and retrieval companies and for the executives who experienced the crushing of their dreams and sometimes career under the buzz saw of reality.

In a nutshell, enterprise search vendors heard what prospects, workers overwhelmed with digital and print information, and unhappy users of those early systems were saying.

The disconnect was that enterprise search vendors parroted back marketing pitches that assured enterprise procurement teams of these functions:

  • Easy to use
  • “All” information instantly available
  • Answers to business questions
  • Faster decision making
  • Access to the organization’s knowledge.

The result was a steady stream of enterprise search product launches. Some of these were funded by US government money like Verity. Sure, the company struggled with the cost of infrastructure the Verity system required. The work arounds were okay as long as the infrastructure could keep pace with the new and changed word-centric documents. Toss in other types of digital information, make the system perform ever faster indexing, and keep the Verity system responding quickly was another kettle of fish.

Research oriented information retrieval experts looked at the Verity type system and concluded, “We can do more. We can use better algorithms. We can use smart software to eliminate some of the costs and indexing delays. We can [ fill in the blank ].

The cycle of describing what an enterprise search system could actually deliver was disconnected from the promises the vendors made. As one moves through the decades from 1973 to the present, the failures of search vendors made it clear that:

  1. Companies and government agencies would buy a system, discover it did not do the job users needed, and buy another system.
  2. New search vendors picked up the methods taught at Cornell, Stanford, and other search-centric research centers and wrap on additional functions like semantics. The core of most modern enterprise search systems is unchanged from what STAIRS implemented.
  3. Search vendors came like Convera, failed, and went away. Some hit revenue ceilings and sold to larger companies looking for a search utility. The acquisitions hit a high water mark with the sale of Autonomy (a 1990s system) to HP for $11 billion.

What about Oracle, as a representative outfit. Oracle database has included search as a core system function since the day Larry Ellison envisioned becoming a big dog in enterprise software. The search language was Oracle’s version of the structured query language. But people found that difficult to use. Oracle purchased Artificial Linguistics in order to make finding information more intuitive. Oracle continued to try to crack the find information problem through the acquisitions of Triple Hop, its in-house Secure Enterprise Search, and some other odds and ends until it bought in rapid succession InQuira (a company formed from the failure of two search vendors), RightNow (technology from a Dutch outfit RightNow acquired), and Endeca. Where is search at Oracle today? Essentially search is a utility and it is available in Oracle applications: customer support, ecommerce, and business intelligence. In short, search has shifted from the “solution” to a component used to get started with an application that allows the user to find the answer to business questions.

I mention the Oracle story because it illustrates the consistent pattern of companies which are actually trying to deliver information that the u9ser of a search system needs to answer a business or technical question.

I don’t want to highlight the inaccuracies of “The Search Continues.” Instead I want to point out the problem buzzwords create when trying to understand why search has consistently been a problem and why today’s most promising solutions may relegate search to a permanent role of necessary evil.

In the write up, the notion of answering questions, analytics, federation (that is, running a single query across multiple collections of content and file types), the cloud, and system performance are the conclusion of the write up.

Wrong.

The use of open source search systems means that good enough is the foundation of many modern systems. Palantir-type outfits, essential an enterprise search vendors describing themselves as “intelligence” providing systems,, uses open source technology in order to reduce costs, shift bug chasing to a community, The good enough core is wrapped with subsystems that deal with the pesky problems of video, audio, data streams from sensors or similar sources. Attivio, formed by professionals who worked at the infamous Fast Search & Transfer company, delivers active intelligence but uses open source to handle the STAIRS-type functions. These companies have figured out that open source search is a good foundation. Available resources can be invested in visualizations, generating reports instead of results lists, and graphical interfaces which involve the user in performing tasks smart software at this time cannot perform.

For a low cost enterprise search system, one can download Lucene, Solr, SphinxSearch, or any one of a number of open source systems. There are low cost (keep in mind that costs of search can be tricky to nail down) appliances from vendors like Maxxcat and Thunderstone. One can make do with the craziness of the search included with Microsoft SharePoint.

For a serious application, enterprises have many choices. Some of these are highly specialized like BAE NetReveal and Palantir Metropolitan. Others are more generic like the Elastic offering. Some are free like the Effective File Search system.

The point is that enterprise search is not what users wanted in the 1970s when IBM pitched the mainframe centric STAIRS system, in the 1980s when Verity pitched its system, in the 1990s when Excalibur (later Convera) sold its system, in the 2000s when Fast Search shifted from Web search to enterprise search and put the company on the road to improper financial behavior, and in the efflorescence of search sell offs (Dassault bought Exalead, IBM bought iPhrase and other search vendors), and Lexmark bought Brainware and ISYS Search Software.

Where are we today?

Users still want on point information. The solutions on offer today are application and use case centric, not the silly one-size-fits-all approach of the period from 2001 to 2011 when Autonomy sold to HP.

Open source search has helped create an opportunity for vendors to deliver information access in interesting ways. There are cloud solutions. There are open source solutions. There are small company solutions. There are more ways to find information than at any other time in the history of search as I know it.

Unfortunately, the same problems remain. These are:

  1. As the volume of digital information goes up, so does the cost of indexing and accessing the sources in the corpus
  2. Multimedia remains a significant challenge for which there is no particularly good solution
  3. Federation of content requires considerable investment in data grooming and normalizing
  4. Multi-lingual corpuses require humans to deal with certain synonyms and entity names
  5. Graphical interfaces still are stupid and need more intelligence behind the icons and links
  6. Visualizations have to be “accurate” because a bad decision can have significant real world consequences
  7. Intelligent systems are creeping forward but crazy Watson-like marketing raises expectations and exacerbates the credibility of enterprise search’s capabilities.

I am okay with history. I am not okay with analyses that ignore some very real and painful lessons. I sure would like some of the experts today to know a bit more about the facts behind the implosions of Convera, Delphis, Entopia, and many other companies.

I also would like investors in search start ups to know a bit more about the risks associated with search and content processing.

In short, for a history of search, one needs more than 900 words mixing up what happened with what is.

Stephen E Arnold, March 9, 2016

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta