CyberOSINT banner

X1 Search: A Unified Single Pane of Glass

May 26, 2015

I read “X1’s Microsoft Enterprise Search Strategy: Better Than Microsoft’s?

Here’s the passage I noted:

Providing one single pane of glass to a business worker’s most critical information assets is key. Requiring end-users to search Outlook for email in one interface, then log into another to search SharePoint, and then another to search for document and OneDrive is a non-starter. A single interface to search for information, no matter where it lives fits the workflow that business workers require.

The write up points out that X1 starts with an “end user’s email and files.” That’s fine, but there are other data types to which an end user requires access.

My reaction was these questions and the answers thereto:

  • What about video?
  • What about drafts of financial data or patent applications and other content centric documents in perpetual draft form?
  • What about images?
  • What about third party content downloaded by a user to a local or shared drive?
  • What about Excel files used as text documents and Excel documents with data and generic column names?
  • What about versions?
  • What about time and data flags versus the time and date information within a content object?
  • What about SMS messages?
  • What information is related to other information; for example, an offer of employment to a former employee?
  • What about employee health, salary, and performance information?
  • What about intercepted data from watched insiders using NGIA tools?
  • What about geo-plotted results based on inputs from the organization’s tracking devices on delivery vans and similar geo systems?

My point is that SharePoint represents a huge market to search and content processing vendors. The generalizations about what a third party system can do boggle my mind. Vendors as a rule do not focus on the content issues my questions probe. There are good reasons for the emphasis on email and experiences. Tackling substantive findability issues within an organization is just not what most SharePoint search alternatives do.

Not surprisingly, for certain types of use cases, SharePoint search remains a bit of a challenge regardless what system is deployed into a somewhat chaotic sea of code, functions, and components.

A unified single pane of glass is redundant. Solutions to the challenges of SharePoint may deserve this type of remediation because vendors have been tilting at the SharePoint windmill in a highly repetitive way for more than a decade. And to what end? For many, SharePoint information access remains opaque, cloudy, and dark.

Stephen E Arnold, May 26, 2015

Welcome YottaSearch

May 26, 2015

There is another game player in the world of enterprise search: Yotta Data Technologies announced their newest product: “Yotta Data Technologies Announces Enterprise Search And Big Data Analytics Platform.”  Yotta Data Technologies is known for its affordable and easy to use information management solutions. Yotta has increased its solutions by creating YottaSearch, a data analytics and search platform designed to be a data hub for organizations.

“YottaSearch brings together the most powerful and agile open source technologies available to enable today’s demanding users to easily collect data, search it, analyze it and create rich visualizations in real time.  From social media and email for Information Governance and eDiscovery to web and network server logs for Information Technology Operations Analytics (ITOA), YottaSearch™ provides the Big Data Analytics for users to derive information intelligence that may be critical to a project, case, business unit or market.”

YottaSearch uses the popular SaaS model and offers users not only data analytics and search, but also knowledge management, information governance, eDiscovery, and IT operations analytics.  Yotta decided to create YottaSearch to earn revenue from the burgeoning big data market, especially the enterprise search end.

The market is worth $1.7 billion, so Yotta has a lot of competition, but if they offer something different and better than their rivals they stand a chance to rise to the top.

Whitney Grace, May 26, 2015
Sponsored by, publisher of the CyberOSINT monograph

Search 2020: Peering into the Future of Information Access

May 22, 2015

The shift in search, user behaviors, and marketing are transforming bread-and-butter keyword search. Quite to my surprise, one of my two or three readers wrote to one of the goslings with a request. In a nutshell, the reader wanted my view of a write up which appeared in the TDWI online publication. TDWI, according to the Web site, is “your source for in depth education and research on all things data.” Okay, I can related to a categorical affirmative, education, research, and data.

The article has a title which tickles my poobah bone: “The Future of Search.” The poobah bone is the part of the anatomy which emits signals about the future. I look at a new search system based on Lucene and other open source technology. My poobah bone tingles. Lots of folks have poobah bones, but these constructs of nerves and tissues are most highly developed in entrepreneurs who invent new ways to locate information, venture capitalists who seek the next Google, and managers who are hired to convert information access into billions and billions of dollars in organic revenue.

The write up identifies three predictions about drivers on the information retrieval utility access road:

  1. Big Data
  2. Cloud infrastructure
  3. Analytics.

Nothing unfamiliar in these three items. Each shares a common characteristic: None has a definition which can be explained in a clear concise way. These are the coat hooks in the search marketers’ cloakroom. Arguments and sales pitches are placed on these hooks because each connotes a “new” way to perform certain enterprise computer processes.

But what about these drivers: Mobile access, just-in-time temporary/contract workers, short attention spans of many “workers”, video, images, and real time information requirements? Perhaps these are subsets of the Big Data, cloud, and analytics generalities, but maybe, just maybe, could these realities be depleted uranium warheads when it comes to information access?

These are the present. What is the future? Here’s a passage I highlighted:

Enterprise search in 2020 will work much differently than it does today. Apple’s Siri, IBM’s Watson, and Microsoft’s Cortana have shown the world how enterprise search and text analytics can combine to serve as a personal assistant. Enterprise search will continue to evolve from being your personal assistant to being your personal advisor.

How are these systems actually working in noisy automobiles or in the kitchen?

I know that the vendors I profiled in CyberOSINT: Next Generation Information Access are installing systems which perform this type of content processing. The problem is that search, as I point out in CyberOSINT, is that the function is, at best, a utility. The heavy lifting comes from collection, automated content processing, and various output options. One of the most promising is to deliver specific types of outputs to both humans and to other systems.

The future does tailor information to a person or to a unit. Organizations are composed of teams of teams, a concept now getting a bit more attention. The idea is not a new one. What is important is that next generation information access systems operate in a more nuanced manner than a list of results from a Lucene based search query.

The article veers into a interesting high school teacher type application of Microsoft’s spelling and grammar checker. The article suggests that the future of search will be to alert the system user his or her “tone” is inappropriate. Well, maybe. I turn off these inputs from software.

The future of search involves privacy issues which have to be “worked out.” No, privacy issues have been worked out via comprehensive, automated collection. The issue is how quickly organizations will make use of the features automated collection and real time processing deliver. Want to eliminate the risk of insider trading? Want to identify bad actors in an organization? One can, but this is not a search function. This is an NGIA function.

The write up touches on a few of the dozens of issues implicit in the emergence of next generation information access systems. But NGIA is not search. NGIA systems are a logical consequence of the failures of enterprise search. These failures are not addressed with generalizations. NGIA systems, while not perfect, move beyond the failures, disappointments, and constant legal hassles search vendors have created in the last 40 years.

My question, “What is taking so long?”

Stephen E Arnold, May 22, 2015

Yotta Search: A Full Service Solution

May 17, 2015

I spoke to a colleague who asked me about Yotta Search. I dug through my Overflight files and located a write up about the new enterprise search system from Yotta Data Technologies and a company called Yotta Customer Analytics. One Yotta is in Cleveland. The other is in Silicon Valley. Both are in the analytics game.

A “yotta” is a whole lotta data, the biggest unit of data. I wonder if the company has a comment on a set of yottas?

I checked my files for the company offering Yotta search, based in Cleveland, home of EPI Thunderstone, another enterprise search vendor. The company behind Yotta Search is Yotta Data Technologies.

According the firm’s Web site at

Yotta Data Technologies (YDT) is a technology company built on a foundation of deep industry experience and driven by a passion for innovative excellence. We provide data management and  information governance solutions to corporations, firms and agencies, whether they be a small local firm or a multinational corporation with offices around the globe.  Each of our platforms maintains the high levels of quality, performance and security that are critical within information governance initiatives and any data management project.

The search system appears to be based on open source technology if I understand this Web site information:

Yotta Search is a versatile enterprise search solution being developed by Yotta Data Technologies (YDT) for teams, small to medium sized businesses and large corporations. Yotta Search provides powerful, fast and flexible technology that is not only well beyond full text search, but also powers the search and analysis features of many of the world’s largest internet sites and data platforms.

The operative phrase is “being developed.” The company asserts capabilities in these functions:

  • Business intelligence
  • Discovery
  • Information governance
  • Virtual data rooms.

I noticed a news item  called “Yotta Data Technologies Announces Enterprise Search and Big Data Analytics Platform.” If the information is correct, Yotta is no longer “being developed,” one can license the system. The url provided is The story describes the Yotta search system in this way.

YottaSearch is easy – and budget friendly – to implement with a cloud-based, Software-as-a-Service (SaaS) delivery model and a disruptive, subscription-based pricing model.

Key Functionality of the YottaSearch

  • Data Point Connectors – Local, Network, Email, Enterprise Systems, Databases, Social Media
  • File Crawlers – Detects & Parses over 1,000 file types
  • File Indexer – Language Detection, Deduplication, Near Real Time, Distributed, Scalable
  • Advanced Search Engines – Based on the high performance Apache Lucene library
  • Data Analytics – Intelligent analysis of structured and unstructured data
  • Dynamic Dashboards – Explore, analyze, navigate and define large volumes of complex data.

The system can be used for a number of applications, according to the write up:

  • Enterprise Search and Analytics
  • Information Governance
  • IT Operations Analytics (ITOA)
  • Investigations & eDiscovery
  • Knowledge Management (KM)
  • Internet of Things (IoT), Event & Log Data Analysis

Also, Yotta offers global data services and global electronic discovery services. The company’s tag line is “Information intelligence for corporations, firms, and agencies.”

Like I said, a lotta yottas and a robust line up of functionality which some more established search and content processing systems do not possess. Is Yotta competing with Elastic or is Yotta competing with the ABC vendors: Attivio, BA Insight, or Coveo? Worth watching.

Stephen E Arnold, May 17, 2015

Quote to Note: How to Make Search Relevant

May 16, 2015

Short honk: I read “Intranet Search? Sssh! Don’t Speak of It.” It seems that enterprise search is struggling and sweeping generalizations about information governance and knowledge management are not helping the situation. But that’s just my opinion.

But set that “issue” aside. Here’s the quote I noted:

The only way this situation [search is a problem’] will change is with intranet managers stepping up to the challenge and telling stories internally. The problem with search analytics (even if you do everything that Lou Rosenfeld [search wizard] recommends) is that there is no direct evidence of the day-to-day impact of search.

Will accountants respond to search stories? Why is there no direct evident of the day to day impact of search? Perhaps search, along with some other hoo hah endeavors, is simply not relevant in today’s business environment? Won’t more hyperbole filled marketing solve the problem? Another conference?

The wet blanket on enterprise search remains “there is no direct evidence of the day to day impact of search.” After 30 or 40 years of implementations and hundreds of millions in search development, why not? Er, what about this thought:

Search is a low value utility which has been over hyped.

Stephen E Arnold, May 17, 2015

Elasticsearch Transparent about Failed Jepsen Tests

May 11, 2015

The article on Aphyr titled Call Me Maybe: Elasticsearch 1.5.0 demonstrates the ongoing tendency for Elasticsearch to lose data during network partitions. The author goes through several scenarios and found that users can lose documents if nodes crash, a primary pauses, a network partitions into two intersecting components or into two discrete components. The article explains,

“My recommendations for Elasticsearch users are unchanged: store your data in a database with better safety guarantees, and continuously upsert every document from that database into Elasticsearch. If your search engine is missing a few documents for a day, it’s not a big deal; they’ll be reinserted on the next run and appear in subsequent searches. Not using Elasticsearch as a system of record also insulates you from having to worry about ES downtime during elections.”

The article praises Elasticsearch for their internal approach to documenting the problems, and especially the page they opened in September going into detail on resiliency. The page clarifies the question among users as to what it meant that the ticket closed. The page states pretty clearly that ES failed their Jepsen tests. The article exhorts other vendors to follow a similar regimen of supplying such information to users.

Chelsea Kerwin, May 11, 2014

Sponsored by, publisher of the CyberOSINT monograph

Semantic Search: The View from a Taxonomy Consultant

May 9, 2015

My team and I are working on a new project. With our Overflight system, we have an archive of memorable and not so memorable factoids about search and content processing. One of the goslings who was actually working yesterday asked me, “Do you recall this presentation?”

The presentation was “Implementing Semantic Search in the Enterprise,” created in 2009, which works out to six years ago. I did not recall the presentation. But the title evoked an image in my mind like this:


I asked, “How is this germane to our present project?’

The reply the gosling quacked was, “Semantic search means taxonomy.” The gosling enjoined me to examine this impressive looking diagram:



I don’t want a document. I don’t want formatted content. I don’t want unformatted content. I want on point results I can use. To illustrate the gap between dumping a document on my lap and presenting some useful, look at this visualization from Geofeedia:


The idea is that a person can draw a shape on a map, see the real time content flowing via mobile devices, and look at a particular object. There are search tools and other utilities. The user of this Geofeedia technology examines information in a manner that does not produce a document to read. Sure, a user can read a tweet, but the focus is on understanding information, regardless of type, in a particular context in real time. There is a classification system operating in the plumbing of this system, but the key point is the functionality, not the fact that a consulting firm specializing in taxonomies is making a taxonomy the Alpha and the Omega of an information access system.

The deck starts with the premise that semantic search pivots on a taxonomy. The idea is that a “categorization scheme” makes it possible to index a document even though the words in the document may be the words in the taxonomy.


For me, the slide deck’s argument was off kilter. The mixing up of a term list and semantic search is the evidence of a Rube Goldberg approach to a quite important task: Accessing needed information in a useful, actionable way. Frankly, I think that dumping buzzwords into slide decks creates more confusion when focus and accuracy are essential.

At lunch the goslings and I flipped through the PowerPoint deck which is available via LinkedIn Slideshare. You may have to register to view the PowerPoint deck. I am never clear about what is viewable, what’s downloadable, and what’s on Slideshare. LinkedIn has its real estate, publishing, and personnel businesses to which to attend, so search and retrieval is obviously not a priority. The entire experience was superficially amusing but on a more profound level quite disturbing. No wonder enterprise search implementations careen in a swamp of cost overruns and angry users.

Now creating taxonomies or what I call controlled term lists can a darned exciting process. If one goes the human route, there are discussions about what term maps to what word or phrase. Think buzz group and discussion group and online collaboration. What terms go with what other terms. In the good old days, these term lists were crafted by subject matter and indexing specialists. For example, the guts of the ABI/INFORM classification coding terms originated in the 1981-1982 period and was the product of more than 14 individuals, one advisor (the now deceased Betty Eddison), and the begrudging assistance of the Courier Journal’s information technology department which performed analyses of the index terms and key words in the ABI/INFORM database. The classification system was reasonably, and it was licensed by the Royal Bank of Canada, IBM, and some other savvy outfits for their own indexing projects.

As you might know, investing two years in human and some machine inputs was an expensive proposition. It was the initial step in the reindexing of the ABI/INFORM database, which at the time was one of the go to sources of high value business and management information culled from more than 800 publications worldwide.

The only problem I have with the slide deck’s making a taxonomy a key concept is that one cannot craft a taxonomy without knowing what one is indexing. For example, you have a flow of content through and into an organization. In a business engaged in the manufacture of laboratory equipment, there will be a wide range of information. There will be unstructured information like Word documents prepared by wild eyed marketing associates. There will be legal documents artfully copied and pasted together from boiler plate. There will be images of the products themselves. There will be databases containing the names of customers, prospects, suppliers, and consultants. There will be information that employees download from the Internet or tote into the organization on a storage device.

The key concept of a taxonomy has to be anchored in reality, not an external term list like those which used to be provided by Oracle  for certain vertical markets. In short, the time and cost of processing these items of information so that confidentiality is not breached is likely to make the organization’s accountant sit up and take notice.

Today many vendors assert that their systems can intelligently, automatically, and rapidly develop a taxonomy for an organization. I suggest you read the fine print. Even the whizziest taxonomy generator is going to require some baby sitting. To get a sense of what is required, track down an experienced licensee of the Autonomy IDOL system. There is a training period which requires a cohesive corpus of representative source material. Sorry, no images or videos accepted but the existing image and video metadata can be processed. Once the system is trained, then it is run against a test set of content. The results are examined by a human who knows what he or she is doing, and then the system is tuned. After the smart system runs for a few days, the human inspects and calibrates. The idea is that as content flows through the system  and periodic tweaks are made, the system becomes smarter. In reality, indexing drift creeps in. In effect, the smart software never strays too far from the human subject matter experts riding herd on algorithms.

The problem exists even when there is a relatively stable core of technical terminology. The content of a lab gear manufacturer is many times greater than the problem of a company focusing on a specific branch of engineering, science, technology, or medicine. Indexing Halliburton nuclear energy information is trivial when compared to indexing more generalized business content like that found in ABI/INFORM or the typical services organization today.

I agree that a controlled term list is important. One cannot easily resolve entities unless there is a combination of automated processes and look up lists. An example is figuring out if a reference to I.B.M., Big Blue, or Armonk is a reference to the much loved marketers of Watson. Now handle a transliterated name like Anwar al-Awlaki and its variants. This type of indexing is quite important. Get it wrong and one cannot find information germane to a query. When one is investigating aliases used by bad actors, an error can become a bad day for some folks.

The remainder of the slide deck rides the taxonomy pony into the sunset. When one looks at the information created 72 months ago, it is easy for me to understand why enterprise search and content processing has become a “oh, my goodness” problem in many organizations. I think that a mid sized company would grind to a halt if it needed a controlled vocabulary which matched today’s content flows.

My take away from the slide deck is easy to summarize: The lesson is that putting the cart before the horse won’t get enterprise where it must go to retain credibility and deliver utility.

Stephen E Arnold, May 9, 2015

Yahoo and Microsoft Announce Search Partnership Reboot

May 7, 2015

It seems that Microsoft and Yahoo are friends again, at least for the time being. Search Engine Watch announces, “Yahoo and Microsoft Amend Search Agreement.” The two companies have been trying to partner on search for the past six years, but it has not always gone smoothly. Writer Emily Alford tells us what will be different this time around:

“First, Yahoo will have greater freedom to explore other search platforms. In the past, Yahoo was rumored to be seeking a partnership with Google, and under the new terms, Microsoft and Yahoo’s partnership will no longer be exclusive for mobile and desktop. Under the new agreement, Yahoo will continue to serve Bing ads on desktop and mobile, as well as use Bing search results for the majority of its desktop search traffic, though the exact number was undisclosed.

“Microsoft and Yahoo are also making changes to the way that ads are served. Microsoft will now maintain control of the Bing ads salesforce, while Yahoo will take full control of its Gemini ads salesforce, which will leave Bing free to serve its own ads side by side with Yahoo search results.”

Yahoo CEO Marissa Mayer painted a hopeful picture in a prepared statement. She and Microsoft CEO Satya Nadella have been working together, she reports, to revamp the search deal. She is “very excited to explore” the fresh possibilities. Will the happy relationship hold up this time around?

Cynthia Murrell, May 7, 2015

Sponsored by, publisher of the CyberOSINT monograph

EnterpriseJungle Launches SAP-Based Enterprise Search System

May 4, 2015

A new enterprise search system startup is leveraging the SAP HANA Cloud Platform, we learn from “EnterpriseJungle Tames Enterprise Search” at SAP’s News Center. The company states that their goal is to make collaboration easier and more effective with a feature they’re calling “deep people search.” Writer Susn Galer cites EnterpriseJungle Principal James Sinclair when she tells us:

“Using advanced algorithms to analyze data from internal and external sources, including SAP Jam, SuccessFactors, wikis, and LinkedIn, the applications help companies understand the make-up of its workforce and connect people quickly….

Who Can Help Me is a pre-populated search tool allowing employees to find internal experts by skills, location, project requirements and other criteria which companies can also configure, if needed. The Enterprise Q&A tool lets employees enter any text into the search bar, and find experts internally or outside company walls. Most companies use the prepackaged EnterpriseJungle solutions as is for Human Resources (HR), recruitment, sales and other departments. However, Sinclair said companies can easily modify search queries to meet any organization’s unique needs.”

EnterpriseJungle users manage their company’s data through SAP’s Lumira dashboard. Galer shares Sinclair’s example of one company in Germany, which used EnterpriseJungle to match employees to appropriate new positions when it made a whopping 3,000 jobs obsolete. Though the software is now designed primarily for HR and data-management departments, Sinclair hopes the collaboration tool will permeate the entire enterprise.

Cynthia Murrell, May 4, 2015

Sponsored by, publisher of the CyberOSINT monograph


Funnelback: Another Enterprise Search Solution Founder Offers Shoulds to Potential Licensees

May 1, 2015

Funnelback, as I have mentioned, has lost some of its marketing oomph. I think some staff shuffling took place. The company is now stepping up its effort to remain visible in a darned tough, crowded, and struggling market sector: Enterprise search.

I read “How Do You Solve a Problem Like Enterprise Search?” My answer is and has been, “One does not. One solves specific information access problems.” The wreckage of Convera, Delphes, Entopia, Fast Search, et al is evidence that enterprise search is a sticky wicket. The howls of pain on the LinkedIn forums and the odd collection of content in the Paper.Li round up about enterprise search make the challenges quite visible.

Read the article. Here are two points I found interesting.

According to the founder of Funnelback:

A holistic enterprise search solution should include:

  • Bird’s-eye view metrics of all content, showing where it’s stored (e.g. web vs. enterprise vs. social media), how much exists in each repository, how old it is, missing metadata, poor quality titles, duplication, accessibility metrics, and the link graph. This provides information managers with a means to prioritize organizational investment in managing information, and thereby enhancing search effectiveness.
  • Intelligent guidance on how to make content more visible/findable. Search engines generally attempt to hide the internals of their ranking systems and this makes it difficult for customers to learn how to make content more findable. An enterprise search engine should use its internal ranking knowledge to show content authors why pages rank the way they do and provide guidance on how to increase each page’s findability.
  • The ability to surface and promote content based on user context with simple rules such as “User is in Department A”, “User is located in New Zealand”, “User is in the finance industry”, “User works for LexisNexis”. These rules can then be overlaid to form more sophisticated rules, without the need to create rules for every distinct possibility. Funnelback goes even further by allowing these rules to be applied to anonymous users by looking up their IP address in an internal database and inferring information based on the organization that owns the IP address.

These are darned interesting “shoulds.” The problem of access controls, contractual and regulatory constraints, and the human practice of creating silos of information are tough nuts to crack. “Shoulds” are easy. Delivering is tough, and Funnelback is neither more or less well equipped than open source or proprietary information retrieval solutions.

The second point illustrates the flawed logic that many champions of enterprise search as a grand solution make. Here’s the passage:

The first question every organization should ask is: Who are the stakeholders affecting the success of our organization and what information do they need to maximize our success?

At a more practical level, this includes questions like:

  • What are the personas in our organization? (i.e. the archetypes that represent the different roles)
  • What information do they need in order to maximize productivity and make better decisions?
  • What are our customer personas?
  • What information do they need in order to maximize engagement and have a positive customer experience?

Without asking these questions, organizations sometimes assume that searching everything with a single query (access controls permitting) is the answer. Sometimes it is the answer, but it can be a more complicated and costly exercise than necessary. For example, do users want to use an enterprise search tool to search their own email, or would they prefer to use the search on their mail client?

Sorry, Funnelback. Asking the questions is the first step. The work is to answer the questions and then use that information to tailor a solution that does  not anger the users, lead to litigation, or just not work.

Today’s flagship enterprise search vendors seem to include Coveo, dtSearch, Elastic, Funnelback, and a handful of other firms with low profiles. The present crisis in information access has been created by the actions of previous industry leaders in enterprise search.

The fix is to focus on solving a problem for a specific group of users. Lawyers have specialized search tools. Chemists have specialized search tools. Regular employees have Google and whatever findabiliy solution is available within specific applications.

Want to get in a pickle? Sell a clueless senior executive a solution that solves the information access challenges for the entire organization. Didn’t work for STAIRS and won’t work for today’s systems.

The history of search is a painful one. There are options, but these are next generation systems, not yesterday’s systems wrapped with shoulds.

Stephen E Arnold, May 1, 2015

Next Page »