Import.io and Connotate: One Year Later
March 3, 2020
There has been an interesting shift in search and content processing. Import.io, founded in 2012, purchased Connotate. Before you ask, “Connotate what?”, let me say that Connotate was a content scraping and analysis firm. I paid some attention to Connotate when it acquired Fetch, an outfit with an honest-to-goodness Xoogler on its team. Fetch processed structure data and Connotate was mostly an unstructured data outfit. I asked a Connotate professional when the company would process Dark Web content, only to be told, “We can’t comment on that.” Secretive, right.
Connotate was founded in 2000 and required about $25 million in funding. The amount Import.io paid was not revealed in a source to which DarkCyber has access. Import.io, which has ingested about $38 million. DarkCyber assumes that the stakeholders are confident that 1 + 1 will equal 3 or more.
Import.io says:
We are funded by some of the greatest minds in technology.
The great minds include AME Cloud Ventures, Open Ocean, IP Group, and several others.
The company explains:
Starting from a simple web data extractor and evolving to an enterprise level solution for concurrently getting data that drives business, industry, and goodness.
What’s the company provide? The answer is Web data integration: Identify, extract, prepare, integrate, and consume content from a user-provided list of urls. To illustrate the depth of the company’s capabilities, Import.io defines “prepare” this way:
Integrate prepared data with a library of APIs to support seamless integration with internal business systems and workflows or deliver it to any data repository to develop robust data sets for advanced analytics capabilities.
The firm’s Web site makes it clear that it serves the online travel, retail, manufacturing, hedge fund, advisory services, data scientists, analysts, journalists, marketing and product, hospitality, and media producers. These are a mix of sectors and industries, and DarkCyber did not create the grammatically inconsistent listing.
Import.io offers videos which provide some information about one of its important innovations “interactive extractors.” The idea is to convert script editing to point-and-click choices.
The company is growing. About a year ago, Import.io said that it experienced record sales growth. The company provided a link to its Help Center, but a number of panels contained neither information nor links to content.
The company offers a free version and a premium version. Price quotes are provided by the company.
Like Amplyfi and maybe ServiceMaster, Import.io is a company providing search and content processing with a 21st century business positioning. A new buzzword is needed to convey what Import.io, Amplyfi, and Service Master are providing. DarkCyber believes that these companies are examples of where search and content processing has begun to coalesce.
The question is, “Is acquiring, indexing, and analyzing OSINT content a truck stop or a destination like Miami Beach?”
Worth monitoring the trajectory of the company.
Stephen E Arnold, March 3, 2020
Microsoft Azure: Search, Artificial Intelligence, and Some Mystical Magic
March 3, 2020
DarkCyber spotted “Microsoft Announcements on Azure Artificial Intelligence.” The article is a summary of assorted Microsoft Azure assertions. Note that the article did not offer any information about Cortana’s and Windows 10 search semi-failure to thrill its users. But Azure is different. Microsoft does Azure better than Windows 10 updates… sometimes.
There were several highlights in the article.
First, Azure has artificial intelligence. The approach is open, interoperable, workflow, and “easy adaptation.” Is this way certified Microsoft Azure professionals are buying new houses and fancier automobiles?
Second, Azure does machine learning. The idea is that there are agents, applications, a machine learning model engine, support for R, and an enterprise edition. DarkCyber does not know a single person running Azure to make life better, faster, and cheaper except Azure consultants. But the big assertion is that Azure’s ML “delivers a unified data science experience.” DarkCyber wonders, “Does this include Outlook attachments?”
Third, Azure has updated some of its “old” features. There’s nothing like constant improvement like the flow of Windows 10 updates, uninstalls, and reinstalls. Now Azure does better decision making. Sentiment analysis has more deep learning and natural language processing. The system can do image analysis, and its has some of that Cortana goodness which has been repositioned in Windows 10 because it was so darned wonderful.
Fourth, Azure does knowledge mining. Azure does cognitive search. Azure recognizes forms.
The showcase client is a publishing company. The Atlantic has gone all in on the Azure systems. Another happy camper is AutoTrader.ca. Plus Archive 360 is tickled with the ability to use Azure cognitive search quickly and cost effectively. Yep, DarkCyber believes this was a smooth, easy implementation.
If you doubt that Microsoft is number one, read the article. If not, you will enjoy some of the ironies. How many search systems does Microsoft offer? How many of them are super? Who remembers Fast Search & Transfer?
Yep, super search the Azure way. It’s just like using Word’s numbering feature or figuring out PowerPoint backgrounds.
Stephen E Arnold, March 3, 2020
Trellis Research Gets Money And New Technical Co-Founder
February 27, 2020
If there is one industry that needs a powerful and accurate search and analytics tool it is court systems. Los Angeles startup Trellis Research specializes in software for state court data, recently made news with a big fundraiser and addition to their team. TechCrunch explains the details in, “Building A search Tool For State Court Data And Analytics, Trellis Adds Alon Schwartz As Co-Founder.”
Trellis Research is a fire starter startup, known for designing analytics and search software for state legal systems. Their most famous products were Dostoc, an online store and electronic document depository for financial, legal, and professional documents and unGlue a startup that regulates screen time for families.
Craft Ventures recently raised $4.4 million in funding for Trellis. The company also added a new technical co-founder Alon Shwartz. Shwartz’s new role will be the chief product officer. He will work side by side with the company founder Nicole Clark
Trellis’s home office is in California, where they service the California Superior court records and judicial analytics. Wit the new round of funding, the company hopes to expand to Florida, Delaware, Texas, and New York. Clark founded Trelis when she discovered a need for better search and analytics software in the courts:
“ ‘I was customer one,’ says Clark of the product. A former litigator in Los Angeles, the entrepreneur developed Trellis to serve her own research needs. ‘I used this data for two years and during those years I won every motion that I had,’ says Clark. ‘It made it so obvious what a competitive advantage this was. It’s a way to analyze how a judge thinks about issues and a lawyer can draft their motions with a particular judge in mind.’”
Trellis offers a freemium service for state trial decisions and filings with search, but to access the actual documents people need to become paying users. There is an $100 fee for individuals and enterprise users are negotiable. Once beyond the paywall, users can file documents, download, print, and analyze them.
Clark promises that attorneys will double their win rate with Trellis Research software.
Whitney Grace, February 27, 2020
Elastic App to Stretch Finding
February 26, 2020
Elasticsearch is one of the most used open source search application. While Elasticsearch is free for open source developers to download, the company offers subscriptions for customer support and enhanced software. Street Insider shares that Elasticsearch added a new addition to their service, “Elastic Announces The General Availability Of Elastic App Search On Elasticsearch Service.”
Starting now Elasticsearch Service users can deploy App Search simply from their dashboard. A powerful search experience is available in mobile devices harnessing the Elastic Cloud. The new Elastic App Search also includes new geolocation services and pricing:
“This milestone also unlocks a whole new choice of geolocation options for Elastic App Search users: from São Paulo to Singapore and California to Germany, App Search can be hosted everywhere you find our Elasticsearch Service.
Elastic didn’t just make getting started on App Search easier — they’ve also simplified pricing by switching to the same resource-based pricing model that Elasticsearch Service uses. With App Search on Elasticsearch Service, users only pay for the resources they consume, without worrying about artificial constraints around the number of users, documents, or operations made. It’s a whole new approach to pricing search that’s transparent and fair.”
Elastic, the parent company, is dedicated to making its software available to anyone who needs powerful search. Elastic offers free trials and opportunities to build prototypes.
Whitney Grace, February 26, 2020
More PR for Cognitive Search
February 20, 2020
With available data growing faster than traditional search technology seems able to handle, ToolBox predicts, “‘Cognitive Search’ May Be the Sector to Watch.” Writer Santiago Perez considers:
“On an individual level, we have all grappled with the frustrating experience of trying to enter just the right keyword or combination of letters and numbers to get to the exact bit of data we need. But as data multiplies continuously in libraries and archives, a new sort of search with the ability to cut through the chaff is coming into its own. It’s called ‘cognitive search.’ As the term suggests, the ‘thinking’ is deeper than that in a traditional keyword search. It’s leveraged by artificial intelligence and machine learning and gathers insights from signals and behavioral data. The insights can come from activities such as employee visits to web pages, their interactions with each other via chat media or the documents they produce and store.”
Perez cites research (PDF) that indicates between 60 percent and 73 percent of information corporations have gathered is currently unused. However, wonder whether the focus is in the right place here—what is the quality of such data? Where does it originate, how was it gathered, and has anyone verified it? For the vast majority, the answer is “of course not.”
Be that as it may, both Amazon and Microsoft are forging ahead with machine-learning based cognitive search solutions to more thoroughly analyze all that (suspect) data. AWS’s Kendra is currently only available in northern Virginia, Oregon, and Ireland, but they do have a preview available for AWS users. Microsoft is positioning its Project Cortex as the “fourth pillar” of Microsoft Office. See the write-up for more details on each of these products.
Cynthia Murrell, February 20, 2020
LucidWorks: Mom, Do My Three Cs Add Up to an A?
February 19, 2020
Search firm Lucidworks has put out a white paper explaining their new 3 C’s of enterprise search, we learn from the write-up, “Understanding Intention: Using Content, Context, and the Crowd to Build Better Search Applications” from InsideBigData. Registration is required to download and read the paper, but they have also put out a PDF called more simply, “Understanding Intention” that gives us their perspective.
In the 3 Cs section of that document, they note that enterprise search pretty much has content wrapped up. With tools like Hadoop, Solr, and NoSQL, we can now access unstructured as well as structured data. Context means, in part, understanding how different pieces of content relate to each other. It also means analyzing which pieces of information will be relevant to each searcher—and this is the exciting part for Lucidworks. The document explains:
“When a search app knows more about you, it can create a relevant search experience that helps you get personal, actionable search results on a consistent basis. Search apps have solved that problem with signal processing. A signal is any bit of information that tells the app more about who you are. Signals can include your job title, business unit, location, device, and search history, as well as past actions within the search app like clickstream, purchasing behavior, direct reports, upcoming meetings or events, and more.”
Interesting. As for the crowd portion, it has to do with matching searchers with content found by similar entities that have searched before. We’re told:
“When a search app uses the crowd, it goes beyond documents and data, past your specific user profile and relationship, and examines how other users are interacting with the data and information. A search app knows the behavioral information of thousands — sometimes millions — of other users. By keeping track of every user, search apps can bubble up what you will find important and relevant and what other users like you will want, too. The tech uses its knowledge of your office, role, and demographic to match to the same in other users and make intelligent judgments about what will help you the most.”
But how good is the tech, really, at identifying what information one truly needs, and how would we know? Do three Cs add up to an A in search? Not yet, Willy.
Cynthia Murrell, February 19, 2020s
A Fanciful Explanation of the Expensive Failure of IBM Watson
February 19, 2020
I love the idea of revisionist history. I associate the method with an individual named Ioseb Besarionis dze Jughashvili.
Alleged Stalin quote: It is not heroes that make history, but history that makes heroes.
You may know this allegedly competent leader as Joseph Stalin. Changing history is one way to make sure the present comes out in a way that is more satisfying — at least to some people.
I read “IBM Watson And The Value Of Open.” I thought of Jughashvili in the terms my former history professor (Dr. Philip Miller Crane) explained the revisionist thing.
My analysis of IBM Watson included information I obtained when I was researching my various and sundry books about search and retrieval. I did not include IBM as a “recommended” solution for three reasons:
- Watson was a marketing confection which conflated a range of technologies: Some developed by IBM and others obtained via an open source download or by paying money for technology; for example, Vivisimo, a metasearch and clustering system
- Training “Watson” required programmers to interview subject matter experts, create specific content domains, test, do more interviews, retrain, and test. Once the content domain was in hand, Watson would crunch away to locate an answer. Many companies do a similar expensive process. IBM was unique in making Watson seem something other than what other vendors offered. By sweeping the time and cost of training under the digital rug, Watson was cut loose from reality.
- Question answering systems work when certain conditions are met; for example, content, response expected, handcrafted rules that mostly work. Toss the system questions based on new content, and the responses are going to be interesting if not off base a certain percentage of the time.
To sum up, the cost and unreliability of Watson were wildly out of step with the marketing of cognitive computing. IBM’s billions made it possible for search and retrieval carpetbaggers to describe their systems as “cognitive”; that is, infused with artificial intelligence, predictive linguistics, and my favorite bit of jargon natural language processing.
The article’s explanation of the failure of IBM’s billion dollar bet, the office near NYU, and the absolutely bonkers ad in the New York Times for Watson as a collection of digital molecules is at odds with my assessment.
That’s okay. Let’s look at a couple of the “revelations” in this Forbes’ article.
The Texas Fold
The write up explains the outright failure of Watson as a useful medical tool for cancer doctors says:
But with the passage of more time, it must be said that IBM Watson has not delivered the results that IBM expected. One particular moment was the decision of MD Anderson’s Cancer Center to withdraw from its partnership with IBM in 2017. An internal audit by the University of Texas found that the university had spent over $62 million dollars (not counting internal staff time) and did not meet its goals.[i] Other health partners soon followed.
Yep, to summarize. Watson did not work. In fact, I heard from a reliable source that cancer doctors in New York City refused to answer endless programmer questions. The message for me was, “Cancer doctors don’t want to teach programmers how to be cancer doctors.” Hasta la vista to Texas.
The Wrong Explanation: Vertical Integration
Why did IBM Watson succumb to its self generated cancer. Here’s what the Forbes’ write up asserts:
Being vertically integrated gave IBM complete end-to-end control over Watson. But it condemned Watson to being applied in only a few areas. IBM essentially had to guess where this powerful technology could best be applied. Even within health care, some likely areas for Watson like radiology were not pursued in its early years. Because of the limited number of areas IBM was able to explore for using Watson, we will never know whether there were other areas where Watson might have performed beautifully.
Okay, this means in my opinion that IBM engineers and scientists wanted to run the show. There was, therefore, one throat to choke. That throat was IBM Watson’s. The fall out continues. A new CEO, hoots of laughter when I tell people about IBM’s Watson ads, and the loss of shareholder value. I would roll in the weird layoffs as a somewhat desperate way to slash costs too.
Alleged Stalin quote: Death is the solution to all problems. No man – no problem.
Forget vertical integration. The reason for failure was that the system and method did not work.
The Reality
Mr. Jughashvili would be proud of this analysis. It rewrites history. But like Mr. Jughashvili’s, Watson’s actions live on. Changing the words does not alter the reality.
Don’t believe me? Just ask IBM Watson. Is IBM “open”?
Stephen E Arnold, February 19, 2020
NoSQL DBMS: A Surprising Inclusion
February 12, 2020
“Top Databases Used in Machine Learning Project” is a listicle. The information in the write up is similar to the lists of “best” products whipped up by Silicon Valley type publications, mid tier consulting firms (a shade off the blue chip outfits like McKinsey, Booz, and BCG), and 20 somethings fresh from university.
The interesting inclusion in the list of DBMS is?
If you said, Elasticsearch you would be correct. Elasticsearch is an open source play doing business as Elastic. The open source version is at its core a search and retrieval system. (Does this mean the index is the data and the database?)
DarkCyber is not going to get into a discussion of whether an enterprise search system can be a database management system. Both sides in the battle are less interested in resolving the fuzzy language than making sales.
Maybe Elasticsearch is just doing what other enterprise search systems have done since the 1980s? Vendors describe search and retrieval as the solution to the world’s data management Wu Flu.
Net net: Without boundaries, why make distinctions? Just close the deal. Distinctions are irrelevant for some business tasks.
Stephen E Arnold, February 12, 2020
Founder of Autonomy: Extradition Action
February 5, 2020
DarkCyber noted this CBR Online story: “Mike Lynch Submits Himself for Arrest.” The write up states:
Former Autonomy CEO Dr Mike Lynch has submitted himself for arrest this morning, in what his legal team described as a formality required as part of an extradition process initiated by the US Department of Justice. Lynch is still contesting extradition.
The story about the founder of Autonomy and DarkTrace continues. A free profile about Autonomy is available at this link. (Note: this document is a rough draft prepared for a client before the Hewlett Packard purchase of the company. Also, Autonomy was a client of mine before I retired in 2013.)
Stephen E Arnold, February 5, 2020
Paris Museums: More Art Online. Search Means Old Fashioned Hunting Around
February 5, 2020
Oh, boy—it is a collection of art from the many Paris Museums available online at Paris Musées Collections. This artist’s daughter is delighted!
Unfortunately, the site’s search functionality disappoints. Unless your goal is either to find a specific work or to aimlessly browse the 150,213 public domain images, it is another almost unusable collection. I suppose trusting to serendipity has its place, but most of us are looking for something a bit more specific, even if we don’t have a particular title or artist in mind.
There is a section titled “Thematic Discovering,” which might be useful to some. They have put together 11 preconfigured themes that span museums, like “Sport, Jeux Olympiqes et Paris” (Sports, Olympic Games, and Paris) or “Elements: Air, Terre, Feu, Eau” (Elements: Air, Earth, Fire, Water). They do make for interesting guided tours. There are also a highlighted Virtual Exhibition and a few suggested works at the bottom of the page.
I was excited to find this resource—it really is a valuable collection to have at our fingertips. If only it were easier to navigate. Check it out if you feel persistent.
And for those who think search is really great. None of the visual art collections feature a search which delivers what most users seek.
Cynthia Murrell, February 5, 2020