January 20, 2017
After reading Search Engine Journal’s, “The Evolution Of Semantic Search And Why Content Is Still King” brings to mind how there RankBrain is changing the way Google ranks search relevancy. The article was written in 2014, but it stresses the importance of semantic search and SEO. With RankBrain, semantic search is more of a daily occurrence than something to strive for anymore.
RankBrain also demonstrates how far search technology has come in three years. When people search, they no longer want to fish out the keywords from their query; instead they enter an entire question and expect the search engine to understand.
This brings up the question: is content still king? Back in 2014, the answer was yes and the answer is a giant YES now. With RankBrain learning the context behind queries, well-written content is what will drive search engine ranking:
What it boils to is search engines and their complex algorithms are trying to recognize quality over fluff. Sure, search engine optimization will make you more visible, but content is what will keep people coming back for more. You can safely say content will become a company asset because a company’s primary goal is to give value to their audience.
The article ends with something about natural language and how people want their content to reflect it. The article does not provide anything new, but does restate the value of content over fluff. What will happen when computers learn how to create semantic content, however?
Whitney Grace, January 20, 2016
November 17, 2015
i read “Half of World’s Museum Specimens Are Wrongly Labeled, Oxford University Finds.” Anyone involved in indexing knows the perils of assigning labels, tags, or what the whiz kids call metadata to an object.
Humans make mistakes. According to the write up:
As many as half of all natural history specimens held in the some of the world’s greatest institutions are probably wrongly labeled, according to experts at Oxford University and the Royal Botanic Garden in Edinburgh. The confusion has arisen because even accomplished naturalists struggle to tell the difference between similar plants and insects. And with hundreds or thousands of specimens arriving at once, it can be too time-consuming to meticulously research each and guesses have to be made.
Yikes. Only half. I know that human indexers get tired. Now there is just too much work to do. The reaction is typical of busy subject matter experts. Just guess. Close enough for horse shoes.
What about machine indexing? Anyone who has retrained an HP Autonomy system knows that humans get involved as well. If humans make mistakes with bugs and weeds, imagine what happens when a human has to figure out a blog post in a dialect of Korean.
The brutal reality is that indexing is a problem. When dealing with humans, the problems do not go away. When humans interact with automated systems, the automated systems make mistakes, often more rapidly than the sorry human indexing professionals do.
What’s the point?
I would sum up the implication as:
Do not believe a human (indexing species or marketer of automated indexing species).
Acceptable indexing with accuracy above 85 percent is very difficult to achieve. Unfortunately the graduates of a taxonomy boot camp or the entrepreneur flogging an automatic indexing system which is powered by artificial intelligence may not be reliable sources of information.
I know that this notion of high error rates is disappointing to those who believe their whizzy new system works like a champ.
Reality is often painful, particularly when indexing is involved.
What are the consequences? Here are three:
- Results of queries are incomplete or just wrong
- Users are unaware of missing information
- Failure to maintain either human, human assisted, or automated systems results in indexing drift. Eventually the indexing is just misleading if not incorrect.
How accurate is your firm’s indexing? How accurate is your own indexing?
Stephen E Arnold, November 17, 2015
August 11, 2015
Editor’s note: The full text of the exclusive interview with Dr. Daniel J. Rogers, co-founder of Terbium Labs, is available on the Xenky Cyberwizards Speak Web service at www.xenky.com/terbium-labs. The interview was conducted on August 4, 2015.
Significant innovations in information access, despite the hyperbole of marketing and sales professionals, are relatively infrequent. In an exclusive interview, Danny Rogers, one of the founders of Terbium Labs, has developed a way to flip on the lights to make it easy to locate information hidden in the Dark Web.
Web search has been a one-trick pony since the days of Excite, HotBot, and Lycos. For most people, a mobile device takes cues from the user’s location and click streams and displays answers. Access to digital information requires more than parlor tricks and pay-to-play advertising. A handful of companies are moving beyond commoditized search, and they are opening important new markets such as secret and high value data theft. Terbium Labs can “illuminate the Dark Web.”
In an exclusive interview, Dr. Danny Rogers, one of the founders of Terbium Labs with Michael Moore, explained the company’s ability to change how data breaches are located. He said:
Typically, breaches are discovered by third parties such as journalists or law enforcement. In fact, according to Verizon’s 2014 Data Breach Investigations Report, that was the case in 85% of data breaches. Furthermore, discovery, because it is by accident, often takes months, or may not happen at all when limited personnel resources are already heavily taxed. Estimates put the average breach discovery time between 200 and 230 days, an exceedingly long time for an organization’s data to be out of their control. We hope to change that. By using Matchlight, we bring the breach discovery time down to between 30 seconds and 15 minutes from the time stolen data is posted to the web, alerting our clients immediately and automatically. By dramatically reducing the breach discovery time and bringing that discovery into the organization, we’re able to reduce damages and open up more effective remediation options.
Terbium’s approach, it turns out, can be applied to traditional research into content domains to which most systems are effectively blind. At this time, a very small number of companies are able to index content that is not available to traditional content processing systems. Terbium acquires content from Web sites which require specialized software to access. Terbium’s system then processes the content, converting it into the equivalent of an old-fashioned fingerprint. Real-time pattern matching makes it possible for the company’s system to locate a client’s content, either in textual form, software binaries, or other digital representations.
One of the most significant information access innovations uses systems and methods developed by physicists to deal with the flood of data resulting from research into the behaviors of difficult-to-differentiate sub atomic particles.
One part of the process is for Terbium to acquire (crawl) content and convert it into encrypted 14 byte strings of zeros and ones. A client such as a bank then uses the Terbium content encryption and conversion process to produce representations of the confidential data, computer code, or other data. Terbium’s system, in effect, looks for matching digital fingerprints. The task of locating confidential or proprietary data via traditional means is expensive and often a hit and miss affair.
Terbium Labs changes the rules of the game and in the process has created a way to provide its licensees with anti-fraud and anti-theft measures which are unique. In addition, Terbium’s digital fingerprints make it possible to find, analyze, and make sense of digital information not previously available. The system has applications for the Clear Web, which millions of people access every minute, to the hidden content residing on the so called Dark Web.
Terbium Labs, a start up located in Baltimore, Maryland, has developed technology that makes use of advanced mathematics—what I call numerical recipes—to perform analyses for the purpose of finding connections. The firm’s approach is one that deals with strings of zeros and ones, not the actual words and numbers in a stream of information. By matching these numerical tokens with content such as a data file of classified documents or a record of bank account numbers, Terbium does what strikes many, including myself, as a remarkable achievement.
Terbium’s technology can identify highly probable instances of improper use of classified or confidential information. Terbium can pinpoint where the compromised data reside on either the Clear Web, another network, or on the Dark Web. Terbium then alerts the organization about the compromised data and work with the victim of Internet fraud to resolve the matter in a satisfactory manner.
Terbium’s breakthrough has attracted considerable attention in the cyber security sector, and applications of the firm’s approach are beginning to surface for disciplines from competitive intelligence to health care.
We spent a significant amount of time working on both the private data fingerprinting protocol and the infrastructure required to privately index the dark web. We pull in billions of hashes daily, and the systems and technology required to do that in a stable and efficient way are extremely difficult to build. Right now we have over a quarter trillion data fingerprints in our index, and that number is growing by the billions every day.
The idea for the company emerged from a conversation with a colleague who wanted to find out immediately if a high profile client list was ever leaded to the Internet. But, said Rogers, “This individual could not reveal to Terbium the list itself.”
How can an organization locate secret information if that information cannot be provided to a system able to search for the confidential information?
The solution Terbium’s founders developed relies on novel use of encryption techniques, tokenization, Clear and Dark Web content acquisition and processing, and real time pattern matching methods. The interlocking innovations have been patented (US8,997,256), and Terbium is one of the few, perhaps the only company in the world, able to crack open Dark Web content within regulatory and national security constraints.
I think I have to say that the adversaries are winning right now. Despite billions being spent on information security, breaches are happening every single day. Currently, the best the industry can do is be reactive. The adversaries have the perpetual advantage of surprise and are constantly coming up with new ways to gain access to sensitive data. Additionally, the legal system has a long way to go to catch up with technology. It really is a free-for-all out there, which limits the ability of governments to respond. So right now, the attackers seem to be winning, though we see Terbium and Matchlight as part of the response that turns that tide.
Terbium’s product is Matchlight. According to Rogers:
Matchlight is the world’s first truly private, truly automated data intelligence system. It uses our data fingerprinting technology to build and maintain a private index of the dark web and other sites where stolen information is most often leaked or traded. While the space on the internet that traffics in that sort of activity isn’t intractably large, it’s certainly larger than any human analyst can keep up with. We use large-scale automation and big data technologies to provide early indicators of breach in order to make those analysts’ jobs more efficient. We also employ a unique data fingerprinting technology that allows us to monitor our clients’ information without ever having to see or store their originating data, meaning we don’t increase their attack surface and they don’t have to trust us with their information.
Stephen E Arnold, August 11, 2015
July 31, 2015
I am now getting interested in the marketing efforts of IBM Watson’s professionals. I have written about some of the items which my Overflight system snags.
I have gathered a handful of gems from the past week or so. As you peruse these items, remember several facts:
- Watson is Lucene, home brew scripts, and acquired search utilities like Vivisimo’s clustering and de-duplicating technology
- IBM said that Watson would be a multi billion dollar business and then dropped that target from 10 or 12 Autonomy scale operations to something more modest. How modest the company won’t say.
- IBM has tallied a baker’s dozen of quarterly reports with declining revenues
- IBM’s reallocation of employee resources continues as IBM is starting to run out of easy ways to trim expenses
- The good old mainframe is still a technology wonder, and it produces something Watson only dreams about: Profits.
Here we go. Remember high school English class and the “willing suspension of disbelief.” Keep that in mind, please.
ITEM 1: “IBM Watson to Help Cities Run Smarter.” The main assertion, which comes from unicorn land, is: “Purple Forge’s “Powered by IBM Watson” solution uses Watson’s question answering and natural language processing capabilities to let users ask questions and get evidence-based answers using a website, smartphone or wearable devices such as the Apple Watch, without having to wait for a call agent or a reply to an email.” There you go. Better customer service. Aren’t government’s supposed to serve its citizens? Does the project suggest that city governments are not performing this basic duty? Smarter? Hmm.
ITEM 2: “Why I’m So Excited about Watson, IBM’s Answer Man.” In this remarkable essay, an “expert” explains that the president of IBM explained to a TV interviewer that IBM was being “reinvented.” Here’s the quote that I found amusing: “IBM invented almost everything about data,” Rometty insisted. “Our research lab was the first one ever in Silicon Valley. Creating Watson made perfect sense for us. Now he’s ready to help everyone.” Now the author is probably unaware that I was, lo, these many years ago, involved with an IBM Herb Noble who was struggling to make IBM’s own and much loved STAIRS III work. I wish to point out that Silicon Valley research did not have its hands on the steering wheel when it came to the STAIRS system. In fact, the job of making this puppy work fell to IBM folks in Germany as I recall.
ITEM 3: “IBM Watson, CVS Deal: How the Smartest Computer on Earth Could Shake Up Health Care for 70m Pharmacy Customers.” Now this is an astounding chunk of public relations output. I am confident that the author is confident that “real journalism” was involved. You know: Interviewing, researching, analyzing, using Watson, talking to customers, etc. Here’s the passage I highlighted: “One of the most frustrating things for patients can be a lack of access to their health or prescription history and the ability to share it. This is one of the things both IBM and CVS officials have said they hope to solve.” Yes, hope. It springs eternal as my mother used to say.
If you find these fact filled romps through the market activating technology of Watson, you may be qualified to become a Watson believer. For me, I am reminded of Charles Bukowski’s alleged quip:
The problem with the world is that the intelligent people are full of doubts while the stupid ones are full of confidence.
Stephen E Arnold, July 31, 2015
June 12, 2015
Like the TSA’s perfect bag, Google’s search is the apex of findability, according to “Google Now Has Just Gotten Insanely Better and Very Freaky.” What causes such pinnacles of praise? According to the write up:
Google announced at an event in Paris a Location Aware Search feature that can answer a new set of questions, without the user having to ask questions that should include addresses or proper place names. Asking Google Now questions like “what is this museum?” or “when was this building built?” in proximity of the Louvre in Paris will get you answers about the Louvre, as Google will be able to use your location and understand what you meant by “this” or “this building”.
How does the feature work when one is looking for information about the location of a Dark Web hidden services server in Ashburn, Virginia? Ah, not so helpful perhaps? What’s the value of a targeted message in this insanely better environment? Good question.
Stephen E Arnold, June 12, 2015
May 12, 2015
Xenky.com has posted a single page which provides one click access to the three CyberOSINT videos. The videos provide highlight of Stephen E Arnold’s new monograph about next generation information access. You can explore the videos which run a total of 30 minutes on the Xenky site. One viewer said, “This has really opened my eyes. Thank you.”
Kenny Toth, May 12, 2015
May 8, 2015
I must admit that I knew very little about the collaborative economy. I used AirBnB once time and worried about my little test. I survived. I rode in an Uber car one time because my son is an aficionado. I am okay with the subway and walking. I ignore apps which allegedly make my life better, faster, and more expensive.
I saw a post which pointed me to the Chief Digital Officer Summit and that pointed me to this page with the amazing honeycomb shown below. The title is “Collaborative Economy Honeycomb 2: Watch It Grow”
The hexagons are okay, but the bulk of the write up is a listing of companies which manifest the characteristics of a collaborative honeycomb outfit.
Most of the companies were unfamiliar to me. I did recognize the names of a couple of the honeycombers; for example, Khan Academy, Etsy, eBay (ah, delightful eBay), Craigslist, Freelancer, the Crypto currencies (yep, my Dark Web work illuminated this hexagon in the honeycomb for me), and Indiegogo (I met the founder at a function in Manhattan).
But the other 150 companies in the list were news to me.
But what caused me to perk up and pay attention was one factoid:
There were zero search, content processing, or next generation information access companies in the list.
I formed a hypothesis which will probably give indigestion to the individuals and financial services firm pumping money into search and content processing companies. Here it is:
The wave of innovation captured in the wonky honeycomb is moving forward with search as an item on a checklist. The finding functions of these outfits boil down to social media buzz and niche marketing. Information access is application centric, not search centric.
If I am correct, why would honeycomb companies in collaboration mode want to pump money into a proprietary keyword search system? Why not use open source software and put effort into features for the app crowd?
Net net: Generating big money from organic license deals may be very difficult if the honeycomb analysis is on the beam. How hard will it be to sell a high priced search system to the companies identified in this analysis? I think that the task might be difficult and time consuming.
the good news is that the list of companies provides outfits like Attivio, BA Insight, Coveo, Recommind, Smartlogic, and other information retrieval firms with some ducks at which to shoot. How many ducks will fall in a fusillade of marketing?
One hopes that the search sharpshooters prevail.
Stephen E Arnold, May 8, 2015
January 10, 2015
I love the phrase “beyond search.” Microsoft uses it, working overtime to become the go-to resource for next generation search. I learned that Oracle also finds the phrase ideal for describing the lash up of traditional database technology, the decades old Endeca technology, and the Dutch matching system from WCC Group.
You can read about this beyond search tie up in “Beyond Search in Policing: How Oracle Redefines Real time Policing and Investigation—Complementary Capabilities of Oracle’s Endeca Information Discovery and WCC’s ELISE.”
The white paper explains in 15 pages how intelligence led policing works. I am okay with the assertions, but I wonder if Endeca’s computationally intensive approach is suitable for certain content processing tasks. The meshing of matching with Endeca’s outputs results in an “integrated policing platform.”
The Oracle marketing piece explains ELISE in terms of “Intelligent Fusion.” Fusion is quite important in next generation information access. The diagram explaining ELISE is interesting:
Endeca’s indexing makes use of the MDex storage engine, which works quite well for certain types of applications; for example, bounded content and point-and-click access. Oracle shows this in terms of Endeca’s geographic output as a mash up:
For me, the most interesting part of the marketing piece was this diagram. It shows how two “search” systems integrate to meet the needs of modern police work:
It seems that WCC’s technology, also used for matching candidates with jobs, looks for matches and then Endeca adds an interface component once the Endeca system has worked through its computational processes.
For Oracle, ELISE and Endeca provide two legs of Oracle’s integrated police case management system.
Next generation information access systems move “beyond search” by integrating automated collection, analytics, and reporting functions. In my new monograph for law enforcement and intelligence professionals, I profile 21 vendors who provide NGIA. Oracle may go “beyond search,” but the company has not yet penetrated NGIA, next generation information access. More streamlined methods are required to cope with the type of data flows available to law enforcement and intelligence professionals.
For more information about NGIA, navigate to www.xenky.com/cyberosint.
Stephen E Arnold, January 10, 2015
September 10, 2014
I read “Artificial Intelligence Is Resurrecting Enterprise Search.” The unstated foundation of this write up is that enterprise search is dead. I am not sure I buy into that assumption. Last time I checked ElasticSearch was thriving with its open source approach. In fact, one “expert” pointed out that the decline in the fortunes of certain Brand Name search systems coincided with the rise in ElasticSearch’s fortunes. Connection? I don’t know, but enterprise search is thriving.
What needs resurrection (either the Phoenix variety or the Henry James’s varieties of mystical experience type) is search vendors whose software does not deliver for licensees. In this category are outfits that have just gone out of business; for example, Convera, Delphes, Entopia, Kartoo, Perfect Search, Siderean Software, and others).
Then there are the vendors with aging technology that have sold out to outfits that pack information retrieval into umbrella applications in order to put hurdles for competitors to scale. If lock in won’t work, then find a way to build a barricade. Outfits with this approach include Dassault, OpenText, Oracle, TeraText (now Leidos), among others.
Also, there are search vendors up to their ears in hock to venture funding firms. With stakeholders wanting some glimmer of a payout, the pressure is mounting. Companies in this leaky canoe include Attivio, BA Insight, Coveo, and Lucid Imagination, among others.
Another group of vendors are what I call long shots. These range from the quirky French search vendors like Antidot to Sinequa. There are some academic spin outs like Funnelback, which is now a commercial operation with its own unique challenges. And there are some other cats and dogs that live from deal to deal.
Finally, there are the giant companies looking for a way to make as much money as possible from the general ennui associated with proprietary search solutions. IBM is pitching Watson and using open source to get the basic findability function up and running. Microsoft is snagging technology from Jabber and bundling in various bits and pieces to deliver on the SharePoint vision of access to information in an organization. This Delve stuff is sort of smart, but until the product ships and provides access to a range of content types, I think Microsoft has a work in progress, not an enterprise solution upon which one can rely. The giant IHS is leveraging acquired technology into a search business, at least in the planners’ spreadsheets. Google offers its Search Appliance, which is one of the most expensive appliance solutions I have encountered. There is one witless mid tier consulting firm that believes a GSA is economical. Okay. And there is the name surfing Schubmehl from IDC who uses other people’s work to build a reputation.
To sum up, ElasticSearch is doing fine. Lots of other vendors are surviving or selling science fiction.
The “Artificial Intelligence Is Resurrecting Enterprise Search” is a write up from one of the outfits eager to generate big dollars to keep the venture capitalists happy. Hey, don’t take the money, if the recipients can’t generate big bucks.
Anyway, the premise of the write up is that enterprise search is dead and Microsoft’s Delve will give the software sector new life. The only folks who will get new life are the Microsoft savvy developers who can figure out how to set up, customize, optimize, and keep operational a grab back of software.
Microsoft wants to provide a corporate SharePoint user with a single interface to the content needed to perform work. This is a pretty tough problem. SageMaker, now long gone, failed at this effort. Google asserted that its Search Appliance could pull off this trick. Google failed. Dozens of vendors talk about federated search and generally deliver results that are of the “close but no cigar” variety.
Now what’s artificial intelligence got to do with Delve? Well, the system uses personalization and cues to figure out what a business SharePoint user wants and needs. We know how well this works with the predictive services available from Apple, Google, and—Microsoft Phone. Each time I use these services, I remember that they don’t work too well. Yep, Google really knows what I want about one out of a 1,000 queries. The other 999 Google generates laughable outputs.
Microsoft will be in the same rubber raft.
The write up does disagree with my viewpoint. Well, that’s okay because the BA Insight professional who tackles artificial intelligence is going to need more than inputs from Dave Schubmehl who recycles my information without my permission. If this write up is any indication, something has gone wrong somewhere along the line with regard to artificial intelligence, which is, I believe, an oxymoron.
Delve is, according the the write up, now “turning search on its head.” What? I need to find information about a specific topic. How will a SharePoint centric solution know I need that information? Well, that is not a viable scenario. Delve only knows what I have previously done. That’s the beauty of smart personalization. The problem is that my queries bounce from Ebola to silencers for tactical shotguns, from meth lab dispersion in Kentucky to the Muslim Brotherhood connections to certain political figures. Yep, Delve is going to be a really big help, right?
The write up asserts:
Companies need to get smarter about how they structure their information by addressing core foundational data layers. Pay attention to corporate taxonomies and introduce automated processes that add additional metadata where it’s left out from unstructured data sets. Doing this homework will make enterprise search results more relevant and will allow better results when interacting with enterprise data — whether it’s through text, voice or based on social distance. Access to enterprise data through intelligent interfaces is only getting better.
My reaction? My goodness. What the heck does this collection of buzzwords have to do with advanced software methods for information retrieval? Not much. That’s what the write is conveying to me.
Hopefully the investors in BA Insight find more to embrace than I do. If I were an investor, I would demand that my money be spent for more impactful essays, not reminders that Microsoft like IBM thrives on services, certification, and customers who may not know how to determine if software is smart.
Stephen E Arnold, September 10, 2014
August 6, 2014
Connotate posted a page that lists 51 features. The title of the Web page is “What Connotate Does Better than Scripts, Scrapers, and Toolkits.” The 51 features are grouped into 10 categories. Several are standard content processing operations; for example, scaling, ease of use, and rapid deployment.
Several are somewhat fuzzy. A good example is the category “Efficiency”. Connotate explains this concept with these features:
- Highly efficient code is automatically generated during Agent training
- Agents bookmark the final destination and identify links that aren’t necessary, bypassing useless links and arriving at the desired data much faster
- Optimized navigation also generates less traffic on target websites
- Supports load balancing
- Multi-threaded – supports simultaneous execution of multiple Agents on a single system
- Optimizes resource usage by analyzing clues during runtime about the various intended uses of the extracted data
From my experience with training systems, I know that the process can be quite a job, particularly when the source content is not scientific, technical, and medical information. STM is somewhat easier because the terminology is less colorful than social media content, for example. The deployment of agents that do not trigger a block by a target is a good idea. But load balancing is a different type of efficiency and one that is becoming part of some vendor’s punch list.
I found the 51 items useful as a thought starter for crafting a listicle.
Stephen E Arnold, August 6, 2014