Enterprise Search: Roasting Chestnuts in the Cloud
March 6, 2015
I read “Seeking Relevancy for Enterprise Search.” I enjoy articles about “relevance.” The word is ambiguous and must be disambiguated. Yep, that’s one of those functions that search vendors love to talk about and rarely deliver.
The point of the write up is that enterprise content should reside in the cloud. The search system can then process the information, build an index, and deliver a service that allows a single search to output a mix of hits.
Sounds good.
My concern is that I am not sure that’s what users want. The reason for my skepticism is that the shift to the cloud does not fix the broken parts of information retrieval. The user, probably an employee or consultant authorized to access the search system, has to guess which keywords unlock the information in the index.
Search vendors continue to roast the chestnuts of results lists, keyword search, and work arounds for performance bottlenecks. The time is right to move from selling chestnuts to those eager to recapture a childhood flavor and move to a more efficient information delivery system. Image source: http://www.mybalkan.org/weather.html
That’s sort of a problem for many searchers today. In many organizations, users express frustration with search because multiple queries are needed to find information that seems relevant. Then the mind numbing, time consuming drudgery begins. The employee opens a hit, scans the document, copies the relevant bit if it is noted in the first place, and pastes the item into a Word file or a OneNote type app and repeats the process. Most users look at the first page of results, pick the most likely suspect, and use that information.
No, you say.
I suggest you conduct the type of research my team and I have been doing for many years. Expert searchers are a rare species. Today’s employees perceive themselves as really busy, able to make decisions with “on hand” information, and believe themselves to be super smart. Armed with this orientation, whatever these folks do is, by definition, pretty darned good.
It is not. Just don’t try telling a 28 year old that she is not a good searcher and is making decisions without checking facts and assessing the data indexed by a system.
What’s the alternative?
My most recent research points to a relatively new branch or tendril of information access. I use the term “cyberosint” to embrace systems that automatically collect, analyze, and output information to users. Originally these systems focused on public content like Facebook, Twitter posts, and Web content. Now the systems are moving inside the firewall.
The result is that the employee interacts with reports generated with information presented in the form of answers, a map with dynamic data showing where certain events are now taking place, and in streams of data that go into other systems such as a programmatic trading application on Wall Street.
Yes, keyword search is available to these systems which can be deployed on premises, in the cloud, or in a hybrid deployment. The main point is that the chokehold of keyword search is broken using smart software, automatic personalization, and reports.
Keyword search is not an enterprise application. Vendors describe the utility function as the ringmaster of the content circus. Traditional enterprise search is like a flimsy card table upon which has been stacked a rich banquet of features and functions.
The card table cannot support the load. The next generation information access systems, while not perfect, represent a significant shift in information access. To learn more, check out my new study, CyberOSINT.
Roasting chestnuts in the cloud delivers the same traditional chestnut. That’s the problem. Users want more. Maybe a free range, organic gourmet burger?
Stephen E Arnold, March 6, 2015
Silobreaker Forms Cyber Partnership with Norwich University
March 4, 2015
I learned that cyber OSINT capable Silobreaker has partnered with Silobreaker. Norwich, the oldest private military college in the US, has a sterling reputation for cyber security courses and degree programs. The Silobreaker online threat intelligence product will be used in the institution’s cyber forensics classes.
Silobreaker’s cyber security product automatically collects open source information from news, blogs, feeds and social media. The system provides easy to use tools and visualizations to make sense of the content.
Kristofer Månsson, CEO and Co-Founder of Silobreaker told Beyond Search:
By offering Silobreaker as part of their studies, Norwich University is addressing the need for a more holistic approach to threat intelligence in cyber security. This partnership showcases the power of Silobreaker to provide relevant context beyond the technical parameters of a threat, hack or a new malware. Understanding the threat landscape and anticipating potential risks will unquestionably also require the analysis of geopolitics, business and world events, which often influence and prompt attacks. We are excited to continue working with Norwich University and to open up the young minds of tomorrow to the ever-evolving cyber landscape.
Silobreaker is used by more than 80 Norwich students. The university offers the product across its cyber security classes including Cyber Criminalistics, Cyber Investigation and Network Forensics. Students learn how to apply Silobreaker’s next generation system to intelligence gathering in the context of their investigations. Students are required to use the technology throughout their independent research projects.
Aron Temkin, dean of the College of Professional Schools said:
In order to maintain our excellence in cyber security research and training, we need to stay on top of the latest emerging technologies. Silobreaker is a powerful tool that is both user-friendly and flexible enough to fit within our cyber education programs.
Dr. Peter Stephenson, director of the university’s Center for Advanced Computing and Digital forensic added:
Students can get useful output quickly, and we do not have to turn a semester forensics class into a ‘How To Use Silobreaker’ session. Cyber events do not occur in a vacuum. There is context around them that often is hard to see. Silobreaker solves that. It cuts through the mass of information available on the Internet and helps our students get to the meat of an issue quickly and with a variety of ways of accessing and displaying it. This is a new way to look at cyber forensics.
Silobreaker is a data analytics company specializing in cyber security and risk intelligence. The company’s products help intelligence professionals to make sense of the overwhelming amount of data available today on the web. Silobreaker collects large volumes of open source data from news, blogs, feeds and social media and provides the tools and visualizations for analyzing and contextualizing such data. Customers save time by working more efficiently through big data-sets and improve their expertise and knowledge from examining and interpreting the data more easily. For more information, navigate to www.silobreaker.com.
Interviews with Silobreaker’s Mat Bjore are available via the free Search Wizards Speak service.
Taxonomy Turmoil: Good Enough May Be Too Much
February 28, 2015
For years, I have posted a public indexing Overflight. You can examine the selected outputs at this Overflight link. (My non public system is more robust, but the public service is a useful temperature gauge for a slice of the content processing sector.)
When it comes to indexing, most vendors provide keyword, concept tagging, and entity extraction. But are these tags spot on? No, most are good enough.
A happy quack to Jackson Taylor for this “good enough” cartoon. The salesman makes it clear that good enough is indeed good enough in today’s marketing enabled world.
I chose about 50 companies that asserted their systems performed some type of indexing or taxonomy function. I learned that the taxonomy business is “about to explode.” I find that to be either an interesting investment tip or a statement that is characteristic of content processing optimists.
Like search and retrieval, plugging in “concepts” or other index terms is a utility function. For example, if one indexes each word in an article appearing in this blog, the article might be about another subject. For example, in this post, I am talking about Overflight, but the real topic is the broader use of metadata in information retrieval systems. I could assign the term “faceted navigation” to this article as a way to mark this article as germane to point and click navigation systems.
If you examine the “reports” Overflight outputs for each of the companies, you will discover several interesting things as I did on February 28, 2015 when I assembled this short article.
- Mergers or buying failed vendors at fire sale prices are taking places. Examples include Lucidea’s purchase of Cuadra and InMagic. Both of these firms are anchored in traditional indexing methods and seemed to be within a revenue envelope until their sell out. Business Objects acquired Inxight and then SAP acquired Business Objects. Bouvet acquired Ontopia. Teradata acquired Revelytix
- Moving indexing into open source. Thomson Reuters acquired ClearForest and made most of the technology available as OpenCalais. OpenText, a rollup outfit, acquired Nstein. SAS acquired Teragram. Smartlogic acquired Schemalogic. (A free report about Schemalogic is available at www.xenky.com/vendor-profiles.)
- A number of companies just failed, shut down, or went quiet. These include Active Classification, Arikus, Arity, Forth ICA, MaxThink, Millennium Engineering, Navigo, Progris, Protege, punkt.net, Questans, Quiver, Reuse Company, Sandpiper,
- The indexing sector includes a number of companies my non public system monitors; for example, the little known Data Harmony with six figure revenues after decades of selling really hard to traditional publishers. Conclusion: Indexing is a tough business to keep afloat.
There are numerous vendors who assert their systems perform indexing, entity, and metadata extraction. More than 18 of these companies are profiled in CyberOSINT, my new monograph. Oracle owns Triple Hop, RightNow, and Endeca. Each of these acquired companies performs indexing and metadata operations. Even the mashed potatoes search solution from Microsoft includes indexing tools. The proprietary XML data management vendor MarkLogic asserts that it performs indexing operations on content stored in its repository. Conclusion: More cyber oriented firms are likely to capture the juicy deals.
So what’s going on in the world of taxonomies? Several observations strike me as warranted:
First, none of the taxonomy vendors are huge outfits. I suppose one could argue that IBM’s Lucene based system is a billion dollar baby, but that’s marketing peyote, not reality. Perhaps MarkLogic which is struggling toward $100 million in revenue is the largest of this group. But the majority of the companies in the indexing business are small. Think in terms of a few hundred thousand in annual revenue to $10 million with generous accounting assumptions.
What’s clear to me is that indexing, like search, is a utility function. If a good enough search system delivers good enough indexing, then why spend for humans to slog through the content and make human judgments. Why not let Google funded Recorded Future identify entities, assign geo codes, and extract meaningful signals? Why not rely on Haystax or RedOwl or any one of more agile firms to deliver higher value operations.
I would assert that taxonomies and indexing are important to those who desire the accuracy of a human indexed system. This assumes that the humans are subject matter specialists, the humans are not fatigued, and the humans can keep pace with the flow of changed and new content.
The reality is that companies focused on delivering old school solutions to today’s problems are likely to lose contracts to companies that deliver what the customer perceives as a higher value content processing solution.
What can a taxonomy company do to ignite its engines of growth? Based on the research we performed for CyberOSINT, the future belongs to those who embrace automated collection, analysis, and output methods. Users may, if the user so chooses, provide guidance to the system. But the days of yore, when monks with varying degrees of accuracy created catalog sheets for the scriptoria have been washed to the margin of the data stream by today’s content flows.
What’s this mean for the folks who continue to pump money into taxonomy centric companies? Unless the cyber OSINT drum beat is heeded, the failure rate of the Overflight sample is a wake up call.
Buying Apple bonds might be a more prudent financial choice. On the other hand, there is an opportunity for taxonomy executives to become “experts” in content processing.
Stephen E Arnold, February 28, 2015
CyberOSINT Wrap
February 21, 2015
The invitation only seminar seems to have kept attendees from experiencing a spike in blood pressure. There were, in my opinion, three takeaways from the presentations from a dozen organizations providing next generation information access systems. Information about the program may be available online.
First, CyberOSINT is maturing as an enterprise software sector.
Second, use cases permit standard return on investment measures to be used.
Third, the organizations working in this niche are tightly integrated with the intelligence and law enforcement communities and with vendors providing services to sidestep the problems presented by using old style methods such as keyword search.
At some point in the near future, there will be an invitation only webinar about cyber OSINT available. To be considered as an attendee, send an email to benkent2020 at yahoo dot com.
I want to thank the companies presenting. The presentations were of exceptional quality and contained significant information payloads.
Stephen E Arnold, February 21, 2015
Automated Collection Keynote Preview
February 14, 2015
On February 19, 2015, I will do the keynote at an invitation only intelligence conference in Washington, DC. A preview of my formal remarks is available in an eight minute video at this link. The preview has been edited. I have inserted an example of providing access to content not requiring a Web site.
A comment about the speed with which information and data change and become available. Humans cannot keep up with external and most internal-to-the-organization information.
The preview also includes a simplified schematic of the principal components of a next generation information access system. The diagram is important because it reveals that keyword search is a supporting utility, not the wonder tool many marketers hawk to unsuspecting customers. The supporting research for the talk and the full day conference appears in CyberOSINT, which is now available as an eBook.
Stephen E Arnold, February 14, 2015
Enterprise Search: Security Remains a Challenge
February 11, 2015
Download an open source enterprise search system or license a proprietary system. Once the system has been installed, the content crawled, the index built, the interfaces set up, and the system optimized the job is complete, right?
Not quite. Retrofitting a keyword search system to meet today’s security requirements is a complex, time consuming, and expensive task. That’s why “experts” who write about search facets, search as a Big Data system, and search as a business intelligence solution ignore security or reassure their customers that it is no big deal. Security is a big deal, and it is becoming a bigger deal with each passing day.
There are a number of security issues to address. The easiest of these is figuring out how to piggyback on access controls provided by a system like Microsoft SharePoint. Other organizations use different enterprise software. As I said, using access controls already in place and diligently monitored by a skilled security administrator is the easy part.
A number of sticky wickets remain; for example:
- Some units of the organization may do work for law enforcement or intelligence entities. There may be different requirements. Some are explicit and promulgated by government agencies. Others may be implicit, acknowledged as standard operating procedure by those with the appropriate clearance and the need to know.
- Specific administrative content must be sequestered. Examples range from information assembled for employee health or compliance requirements for pharma products or controlled substances.
- Legal units may require that content be contained in a managed system and administrative controls put in place to ensure that no changes are introduced into a content set, access is provided to those with specific credential, or kept “off the radar” as the in house legal team tries to figure out how to respond to a discovery activity.
- Some research units may be “black”; that is, no one in the company, including most information technology and security professionals are supposed to know where an activity is taking place, what the information of interest to the research team is, and specialized security steps be enforced. These can include dongles, air gaps, and unknown locations and staff.
An enterprise search system without NGIA security functions is like a 1960s Chevrolet project car. Buy it ready to rebuild for $4,500 and invest $100,000 or more to make it conform to 2015’s standards. Source: http://car.mitula.us/impala-project
How do enterprise search systems deal with these access issues? Are not most modern systems positioned to index “all” content? Is the procedures for each of these four examples part of the enterprise search systems’ administrative tool kit?
Based on the research I conducted for CyberOSINT: Next Generation Information Access and my other studies of enterprise search, the answer is, “No.”
Cyber Threats Boost Demand for Next Generation System
February 10, 2015
President Obama’s announcement of a new entity to combat the deepening threat from cyber attacks adds an important resource to counter cyber threats.
The decision reflects the need for additional counter terrorism resources in the wake of the Sony and Anthem security breaches. The new initiative serves both Federal and commercial sectors’ concerns with escalating cyber threats.
The Department of Homeland Security said in a public release: “National Cybersecurity and Communications Integration Center mission is to reduce the likelihood and severity of incidents that may significantly compromise the security and resilience of the Nation’s critical information technology and communications networks.”
For the first time, a clear explanation of the software and systems that perform automated collection and analysis of digital information is available. Stephen E. Arnold’s new book is “CyberOSINT: Next Generation Information Access” was written to provide information about advanced information access technology. The new study was published by Beyond Search on January 21, 2015.
The author is Stephen E Arnold, a former executive at Halliburton Nuclear Services and Booz, Allen & Hamilton . He said: “The increase in cyber threats means that next generation systems will play a rapidly increasing part in law enforcement and intelligence activities.”
The monograph explains why next generation information access systems are the logical step beyond keyword search. Also, the book provides the first overview of the architecture of cyber OSINT systems. The monograph provides profiles of more than 20 systems now available to government entities and commercial organizations. The study includes a summary of the year’s research behind the monograph and a glossary of the terms used in cyber OSINT.
Cyber threats require next generation information access systems due to proliferating digital attacks. According to Chuck Cohen, lieutenant with a major Midwestern law enforcement agency and adjunct instructor at Indiana University, “This book is an important introduction to cyber tools for open source information. Investigators and practitioners needing an overview of the companies defining this new enterprise software sector will want this monograph.”
In February 2015, Arnold will keynote a conference on CyberOSINT held in the Washington, DC area. Attendance to the conference is by invitation only. Those interested in the a day long discussion of cyber OSINT can write benkent2020 at yahoo dot com to express their interest in the limited access program.
Arnold added: “Using highly-automated systems, governmental entities and corporations can detect and divert cyber attacks and take steps to prevent assaults and apprehend the people that are planning them. Manual methods such as key word searches are inadequate due to the volume of information to be analyzed and the rapid speed with which threats arise.”
Robert David Steele, a former CIA professional and the co-creator of the Marine Corps. intelligence activity said about the new study: “NGIA systems are integrated solutions that blend software and hardware to address very specific needs. Our intelligence, law enforcement, and security professionals need more than brute force keyword search. This report will help clients save hundreds of thousands of dollars.”
Information about the new monograph is available at www.xenky.com/cyberosint.
Ken Toth, February 10, 2015
Enterprise Search: Mapless and Lost?
February 5, 2015
One of the content challenges traditional enterprise search trips over is geographic functions. When an employee looks for content, the implicit assumption is that keywords will locate a list of documents in which the information may be located. The user then scans the results list—whether in Google style laundry lists or in the graphic display popularized by Grokker and Kartoo which have gone dark. (Quick aside: Both of these outfits reflect the influence of French information retrieval wizards. I think of these as emulators of Datops “balls” displays.)
A results list displayed by the Grokker system. The idea is that the user explores the circular areas. These contain links to content germane to the user’s keyword query.
The Kartoo interface displays sources connected to related sources. Once again the user clicks and goes through the scan, open, read, extract, and analyze process.
In a broad view, both of these visualizations are maps of information. Do today’s users want these type of hard to understand maps?
In CyberOSINT I explore the role of “maps” or more properly geographic intelligence (geoint), geo-tagging, and geographic outputs) from automatically collected and analyzed data.
The idea is that a next generation information access system recognizes geographic data and displays those data in maps. Think in terms of overlays on the eye popping maps available from commercial imagery vendors.
What do these outputs look like? Let me draw one example from the discussion in CyberOSINT about this important approach to enterprise related information. Keep in mind that an NGIA can process any information made available to the systems; for example, enterprise accounting systems or databased content along with text documents.
In response to either a task, a routine update when new information becomes available, or a request generated by a user with a mobile device, the output looks like this on a laptop:
Source: ClearTerra, 2014
The approach that ClearTerra offers allows a person looking for information about customers, prospects, or other types of data which carries geo-codes appears on a dynamic map. The map can be displayed on the user’s device; for example a mobile phone. In some implementations, the map is a dynamic PDF file which displays locations of items of interest as the item of interest moves. Think of a person driving a delivery truck or an RFID tagged package.
Twitter Loves Google Again and for Now
February 5, 2015
I have been tracking Twitter search for a while. There are good solutions, but these require some heavy lifting. The public services are hit and miss. Have you poked into the innards of TweetTunnel?
I read “Twitter Strikes Search Deal with Google to Surface Tweets.” Note that this link may require you to pay for access or the link has gone dead. According to the news story:
The deal means the 140-character messages written by Twitter’s 284 million users could be featured faster and more prominently by the search engine. The hope is that greater placement in Google’s search results could drive more traffic to Twitter, which could one day sell advertising to these visitors when they come to the site, or more important, entice them to sign up for the service.
Twitter wants to monetize its content. Google wants to sell ads.
The only hitch in the git along is that individual tweets are often less useful than processing of tweets by a person, a tag, or some other index point. A query for a tweet can be darned misleading. Consider running a query for a tweet on the Twitter search engine. Enter the term “thunderstone”. What do you get? Games. What about the search vendor Thunderstone. Impossible to find, right?
For full utility from Twitter, one may want to license the Twitter stream from an authorized vendor. Then pump the content into a next generation information access system. Useful outputs result for many concepts.
For more about NGIA systems and processing large flows of real time information, see CyberOSINT: Next Generation Information Access. Reading an individual tweet is often less informative than examining subsets of tweets.
Stephen E Arnold, February 5, 2015
Enterprise Search: NGIA Vendors Offer Alternative to the Search Box
February 4, 2015
I have been following the “blast from the past” articles that appear on certain content management oriented blogs and news services. I find the articles about federated search, governance, and knowledge related topics oddly out of step with the more forward looking developments in information access.
I am puzzled because the keyword search sector has been stuck in a rut for many years. The innovations touted in the consulting-jargon of some failed webmasters, terminated in house specialists, and frustrated academics are old, hoary with age, and deeply problematic.
There are some facts that cheerleaders for the solutions of the 1970s, 1980s, and 1990s choose to overlook:
- Enterprise search typically means a subset of content required by an employee to perform work in today’s fluid and mobile work environment. The mix of employees and part timers translates to serious access control work. Enterprise search vendors “support” an organization’s security systems in the manner of a consulting physician to heart surgery. Inputs but no responsibility are the characteristics.
- The costs of configuring, testing, and optimizing an old school system are usually higher than the vendor suggests. When the actual costs collide with the budget costs, the customer gets frisky. Fast Search & Transfer’s infamous revenue challenges came about in part because customers refused to pay when the system was not running and working as the marketers suggested it would.
- Employees cannot locate needed information and don’t like the interfaces. The information is often “in” the system but not in the indexes. And if in the indexes, the users cannot figure out which combination of keywords unlocks what’s needed. The response is, “Who has time for this?” When a satisfaction measure is required somewhere between 55 and 75 percent of the search system’s users don’t like it very much.
Obviously organizations are looking for alternatives. These range from using open source solutions which are good enough. Other organizations put up with Windows’ search tools, which are also good enough. More important software systems like an enterprise resource planning or accounting system come with basis search functions. Again: These are good enough.
The focus of information access has shifted from indexing a limited corpus of content using a traditional solution to a more comprehensive, automated approach. No software is without its weaknesses. But compared to keyword search, there are vendors pointing customers toward a different approach.
Who are these vendors? In this short write up, I want to highlight the type of information about next generation information access vendors in my new monograph, CyberOSINT: Next Generation Information Access.
I want to highlight one vendor profiled in the monograph and mention three other vendors in the NGIA space which are not included in the first edition of the report but for whom I have reports available for a fee.
I want to direct your attention to Knowlesys, an NGIA vendor operating in Hong Kong and the Nanshan District, Shenzhen. On the surface, the company processes Web content. The firm also provides a free download of a scraping software, which is beginning to show its age.
Dig a bit deeper, and Knowlesys provides a range of custom services. These include deploying, maintaining, and operating next generation information access systems for clients. The company’s system can process and make available automatically content from internal, external, and third party providers. Access is available via standard desktop computers and mobile devices:
Source: Knowlesys, 2014.
The system handles both structured and unstructured content in English and a number of other languages.
The company does not reveal its clients and the firm routinely ignores communications sent via the online “contact us” mail form and faxed letters.
How sophisticated in the Knowlesys system? Compared to the other 20 systems analyzed for the CyberOSINT monograph, my assessment is that the company’s technology is on a part with that of other vendors offering NGIA systems. The plus of the Knowlesys system, if one can obtain a license, is that it will handle Chinese and other ideographic languages as well as the Romance languages. The downside is that for some applications, the company’s location in China may be a consideration.