Intelligent Tagging Makes Unstructured Data Usable

March 20, 2020

We are not going to talk about indexing accuracy. Just keep that idea in mind, please.

Unstructured data is a nightmare nobody wants to handle. Within a giant unstructured mess, however, is usable information. How do you get to the golden information? There are multiple digital solutions, software applications, and big data tools that are supposed to get the job done. It raises another question: which tool do you choose? Among these choices is Intelligent Tagging from Refinitiv.

What is “intelligent tagging?”

“Intelligent Tagging uses natural language processing, text analytics and data-mining technologies to derive meaning from vast amounts of unstructured content. It’s the fastest, easiest and most accurate way to tag the people, places, facts and events in your data, and then assign financial topics and themes to increase your content’s value, accessibility and interoperability. Connecting your data consistently with Intelligent Tagging helps you to search smarter, personalize content recommendations and generate alpha.”

Intelligent Tagging can read through gigabytes of different textual information (emails, texts, notes, etc.) using natural language processing. The software structures data by assigning them tags, then forming connections from the content. After the information is organized, the search is empowered to quickly locate the desired information. Content can be organized in a variety of ways such as companies, people, location, topics, and more. Relevancy scores are added to determine how relevant a search indicator is to the search results. Intelligent Tagging also updates itself in real time by paying attention to the news and adding new metadata tags.

It is an optimized search experience and yields more powerful results in less time than similar software.

Intelligent Tagging offers a necessary service, but the only way to see if it promises to bring structure to data piles is to test it out.

Whitney Grace, March 20, 2020

Written by Stephen E. Arnold · Filed Under Indexing, News | Comments Off on Intelligent Tagging Makes Unstructured Data Usable

IslandInText Reborn: TLDRThis

March 16, 2020

Many years ago (maybe 25+), we tested a desktop summarization tool called IslandInText. [#1 below] I believe, if my memory is working today, this was software developed in Australia by Island Software. There was a desktop version and a more robust system for large-scale summarizing of text. In the 1980s, there was quite a bit of interest in automatic summarization of text. Autonomy’s system could be configured to generate a précis if one was familiar with that system. Google’s basic citation is a modern version of what smart software can do to suggest what’s in a source item. No humans needed, of course. Too expensive and inefficient for the big folks I assume.

For many years, human abstract and indexing professionals were on staff. Our automated systems, despite their usefulness, could not handle nuances, special inclusions in source documents like graphs and tables, list of entities which we processed with the controlled term MANYCOMPANIES, and other specialized functions. I would point out that most of today’s “modern” abstracting and indexing services are simply not as good as the original services like ABI / INFORM, Chemical Abstracts, Engineering Index, Predicasts, and other pioneers in the commercial database sector. (Anyone remember Ev Brenner? That’s what I thought, gentle reader. One does not have to bother oneself with the past in today’s mobile phone search expert world.)

For a number of years, I worked in the commercial database business. In order to speed the throughput of our citations to pharmaceutical, business, and other topic domains – machine text summarization was of interest to me and my colleagues.

A reader informed me that a new service is available. It is called TLDRThis. Here’s what the splash page looks like:

One can paste text or provide a url, and the system returns a synopsis of the source document. (The advanced service generates a more in dept summary, but I did not test this. I am not too keen on signing up without knowing what the terms and conditions are.) There is a browser extension for the service. For this url, the system returned this summary:

Enterprise Search: The Floundering Fish!

Stephen E. Arnold Monitors Search,Content Processing,Text Mining,Related Topics His High-Tech Nerve Center In Rural Kentucky.,He Tries To Winnow The Goose Feathers The Giblets. He Works With Colleagues,Worldwide To Make This Web Log Useful To Those Who Want To Go,Beyond Search . Contact Him At Sa,At,Arnoldit.Com. His Web Site,With Additional Information About Search Is | Oct 27, 2011 | Time Saved: 5 mins

I am thinking about another monograph on the topic of “enterprise search.” The subject seems to be a bit like the motion picture protagonist Jason.

The landscape of enterprise search is pretty much unchanged.

But the technology of yesterday’s giants of enterprise search is pretty much unchanged.

The reality is that the original Big Five had and still have technology rooted in the mid to late 1990s.

We noted several positive functions; for example, identifying the author and providing a synopsis of the source, even the goose feathers’ reference. On the downside, the system missed the main point of the article; that is, enterprise search has been a bit of a chimera for decades. Also, the system ignored the entities (company names) in the write up. These are important in my experience. People search for names, concepts, and events. The best synopses capture some of the entities and tell the reader to get the full list and other information from the source document. I am not sure what to make of the TLDRThis’ display of a picture which makes zero sense without the context of the full article. I fed the system a PDF which did not compute and I tried a bit.ly link which generated a request to refresh the page, not the summary.

To get an “advanced summary”, one must sign up. I did not choose to do that. I have added this site to our “follow” list. I will make a note to try and find out who developed this service.

The pricing ranges from free for basic summarization to $60 per year for Bronze level service. Among its features are 100 summaries per month and “exclusive features”. These are coming soon. The top level service is $10 per month. The fee includes 300 summaries a month and “exclusive features.” These are also coming soon. The Platinum service is $20 per month and includes 1,000 summaries per month. These are “better” and will include forthcoming advanced features.

Stay tuned.

[#1 ] In the early 1990s, search and retrieval was starting to move from the esoteric world of commercial databases to desktop and UNIX machines. IslandSoft, founded in 1993, offered a search and retrieval system. My files from this time revealed that IslandSoft’s description of its system could be reused by today’s search and retrieval marketers. Here’s what IslandSoft said about InText:

IslandInTEXT is a document retrieval and management application for PCs and Unix workstations. IslandInTEXT’s powerful document analysis engine lets users quickly access documents through plain English queries, summarize large documents based on content rather than key words, and automatically route incoming text and documents to user-defined SmartFolders. IslandInTEXT offers the strongest solution yet to help organize and utilize information with large numbers of legacy documents residing on PCs, workstations, and servers as well as the proliferation of electronic mail documents and other data. IslandInTEXT supports a number of popular word processing formats including IslandWrite, Microsoft Word, and WordPerfect plus ASCII text.

IslandInTEXT Includes:

File cabinet/file folder metaphor.
HTML conversion.
Natural language queries for easily locating documents.
Relevancy ranking of query results.
Document summaries based on statistical relevance from 1 to 99% of the original document—create executive summaries of large documents instantly. [This means that the user can specify how detailed the summarization was; for example, a paragraph or a page or two.]
Summary Options. Summaries can be based on key word selection, key word ordering, key sentences, and many more.

[For example:] SmartFolder Routing. Directs incoming text and documents to user-defined folders. Hot Link Pointers. Allow documents to be viewed in their native format without creating copies of the original documents. Heuristic/Learning Architecture. Allows InTEXT to analyze documents according to the author’s style.

A page for InText is still online as of today at http://www.intext.com/. The company appears to have ceased operations in 2010. Data in my files indicate that the name and possibly the code is owned by CP Software, but I have not verified this. I did not include InText in my first edition of Enterprise Search Report, which I wrote in 2003 and 2004. The company had falled behind market leaders Autonomy, Endeca, and Fast Search & Transfer.

I am surprised at how many search and retrieval companies today are just traveling along well worn paths in the digital landscape. Does search work? Nope. That’s why there are people who specialize, remember things, and maintain personal files. Mobile device search means precision and recall are digital dodo birds in my opinion.

Stephen E Arnold, March 16, 2020

Written by Stephen E. Arnold · Filed Under Indexing, News, Text processing | Comments Off on IslandInText Reborn: TLDRThis

Eliminalia: Reputation Management and Content Removal

March 12, 2020

One of our readers called our attention to a company called Eliminalia. This firm provides what DarkCyber considers reputation management services. The unique selling proposition for the firm is that it says that it can achieve results quickly. DarkCyber does not have a firm position on the value of reputation management firms. The organizations or individuals who want content removed may feel a compelling need to modify history or take content corrective actions. Because removing content rests in the hands of a third party, often a large indexing company, getting attention and action can be a challenging job. Europa Press asserts that 24 percent of people and businesses want to have data about them removed from “the Internet.” We took a quick look at our files and located some information. Here’s a summary of points we found interesting.

Plus, the firm asserts:

We are the first to guarantee the results or we will refund your money. We will give an answer to your doubts and needs. We will help you and advise you on a global level.

The firm adds:

We delete internet data and information and guarantee your right to be forgotten. Eliminalia is the leading company in the field which guarantees that the information that bothers and harms you is completely deleted from Internet search engines (Google, Bing, etc.), web portals, blogs..

The firm offers three videos on Vimeo. The most recent video is at https://vimeo.com/222670049 and includes this commentary:

Eliminalia is a renowned company with several world headquarters that protects online privacy and reputation of its customers, finding and removing negative contents from the Web.

There are several YouTube videos as well. These may be located at this link.

The company has offices in Brazil, Colombia, Ecuador, Italy, Mexico, Switzerland, and the United Kingdom.

Eliminalia offers a mobile app for iPhones and Android devices.

The firm’s Web site asserts:

99% happy satisfied clients
8260+ success stories
3540 business clients.

The company states:

We delete your name from:

Mass media

State gazettes

Social media

The president of Eliminalia is Dídac Sánchez. The company was founded in 2013. Crunchbase lists the date of the company’s founding as 2011.

There is an interesting, but difficult to verify, article about the Eliminalia process in “Why Is William Hill a Corporate Partner of Alzheimer’s Society?” The assertions about Eliminalia appear toward the end of the WordPress post. These can be located by searching for the term “Eliminalia.” One interesting item in the write up is that the Eliminalia business allegedly shares an address with World Intelligence Ltd. It is also not clear if Eliminalia is headquartered in Manchester at 53 Fountain Street. Note: the William Hill article includes other names allegedly associated with the company.

DarkCyber believes the company focuses on selling its services in countries with data protection regulations. The firm has a strong Spanish flavor.

If you are interested in having content removed from the Internet, consider speaking with Eliminalia. DarkCyber believes that some content can be difficult to remove. Requests for removal can be submitted. Some sites have a “removal request button” like www.accessify.com. However, there may be backlogs, bureaucracy, and indifference to requests which may be interpreted as trivial or nuisance. Some of our information revealed quite interesting information about the firm. DarkCyber can prepare a more robust summary of the company, including information about the methods used to remove content from the Internet.

Stephen E Arnold, March 12, 2020

Written by Stephen E. Arnold · Filed Under Consulting, News, Online (general), Privacy | Comments Off on Eliminalia: Reputation Management and Content Removal

WhatsApp: Indexed by Google

March 11, 2020

The Orissa Post reports, “Google Indexes Private WhatsApp Group Chat Links.” As a result of the search indexing, assorted private chat groups were summarily forced open for anyone to join. Writer Ians reports,

“According to a report in Motherboard, invitations to WhatsApp group chats were being indexed by Google. The team found private groups using specific Google searches and even joined a group intended for NGOs accredited by the UN and had access to all the participants and their phone numbers. Journalist Jordan Wildon said on Twitter that he discovered that WhatsApp’s ‘Invite to Group Link’ feature lets Google index groups, making them available across the internet since the links are being shared outside of WhatsApp’s secure private messaging service. ‘Your WhatsApp groups may not be as secure as you think they are,’ Wildon tweeted Friday, adding that using particular Google searches, people can discover links to the chats. According to app reverse-engineer Jane Wong, Google has around 470,000 results for a simple search of ‘chat.whatsapp.com’, part of the URL that makes up invites to WhatsApp groups.”

A spokesperson for WhatsApp confirmed that publicly posted invite links would be available to other WhatsApp users, and insists folks should not have to worry their private invites may be made public in this way. On the other hand, Google’s public search liaison seemed to place the blame squarely on WhatsApp. He tweets:

“Search engines like Google & others list pages from the open web. That’s what’s happening here. It’s no different than any case where a site allows URLs to be publicly listed. We do offer tools allowing sites to block content being listed in our results.”

Perhaps both companies could have handled this issue with more consideration. We wonder whether WhatsApp has since taken advantage of those content-blocking tools.

Cynthia Murrell, March 11, 2020

Written by Stephen E. Arnold · Filed Under Facebook, Google, News | Comments Off on WhatsApp: Indexed by Google

Import.io and Connotate: One Year Later

March 3, 2020

There has been an interesting shift in search and content processing. Import.io, founded in 2012, purchased Connotate. Before you ask, “Connotate what?”, let me say that Connotate was a content scraping and analysis firm. I paid some attention to Connotate when it acquired Fetch, an outfit with an honest-to-goodness Xoogler on its team. Fetch processed structure data and Connotate was mostly an unstructured data outfit. I asked a Connotate professional when the company would process Dark Web content, only to be told, “We can’t comment on that.” Secretive, right.

Connotate was founded in 2000 and required about $25 million in funding. The amount Import.io paid was not revealed in a source to which DarkCyber has access. Import.io, which has ingested about $38 million. DarkCyber assumes that the stakeholders are confident that 1 + 1 will equal 3 or more.

Import.io says:

We are funded by some of the greatest minds in technology.

The great minds include AME Cloud Ventures, Open Ocean, IP Group, and several others.

The company explains:

Starting from a simple web data extractor and evolving to an enterprise level solution for concurrently getting data that drives business, industry, and goodness.

What’s the company provide? The answer is Web data integration: Identify, extract, prepare, integrate, and consume content from a user-provided list of urls. To illustrate the depth of the company’s capabilities, Import.io defines “prepare” this way:

Integrate prepared data with a library of APIs to support seamless integration with internal business systems and workflows or deliver it to any data repository to develop robust data sets for advanced analytics capabilities.

The firm’s Web site makes it clear that it serves the online travel, retail, manufacturing, hedge fund, advisory services, data scientists, analysts, journalists, marketing and product, hospitality, and media producers. These are a mix of sectors and industries, and DarkCyber did not create the grammatically inconsistent listing.

Import.io offers videos which provide some information about one of its important innovations “interactive extractors.” The idea is to convert script editing to point-and-click choices.

The company is growing. About a year ago, Import.io said that it experienced record sales growth. The company provided a link to its Help Center, but a number of panels contained neither information nor links to content.

The company offers a free version and a premium version. Price quotes are provided by the company.

Like Amplyfi and maybe ServiceMaster, Import.io is a company providing search and content processing with a 21st century business positioning. A new buzzword is needed to convey what Import.io, Amplyfi, and Service Master are providing. DarkCyber believes that these companies are examples of where search and content processing has begun to coalesce.

The question is, “Is acquiring, indexing, and analyzing OSINT content a truck stop or a destination like Miami Beach?”

Worth monitoring the trajectory of the company.

Stephen E Arnold, March 3, 2020

Written by Stephen E. Arnold · Filed Under Business intelligence, Business strategy, News, Search | Comments Off on Import.io and Connotate: One Year Later

Google: Feeling the Competitive Heat

February 28, 2020

Google, DarkCyber assumes, thought that Microsoft’s decision to convert the Chrome browser into Credge was a victory. “Google Is Now Warning Millions Of Microsoft Edge Users To Switch To Chrome: Here’s Why” tries to explain Googley thinking.

We learn from the capitalist tool:

Google has been found “abusing user agents,” the identifying code that enables websites to identify the browser type and version, to detect and warn Microsoft Edge users visiting the Chrome web store that when it comes to extensions they should switch to Chrome. The reason for the warning is that Microsoft Edge doesn’t integrate with the Safe Browsing protections Google uses to remove threats—so when an extension presents a risk, Google can’t act in the same way to protect users.

Is this the only reason?

DarkCyber thinks a bit of context will explain some of the Googley thinking.

Consider Google and Amazon.

Google does not like Amazon, especially when Amazon stepped away from solely being a retailer to offering software services to customers. Google wants some of Amazon’s cloud business, so they are telling retailers to chuck Amazon and check out their tools. ZDNet rolls out the gossip in the article, “Google’s Pitch To Retailers: We’ll Help You, From Search To Supply Chain.”

At the National Retail Federation, Google introduced retailers to a new line of tools available via its cloud. The tools range from product discovery, supply chain optimization, and hybrid application management. Thomas Kurian, Google Cloud CEO, explained that the retailers who innovate with their business plans are the most successful. Google wants to grab these forward thinking retailers with new tools like:

“Among the new offerings for retailers is a new tool called Google Cloud Search for Retail, which Google is piloting now and will introduce to the broader market throughout the year. The tool helps retailers improve search results for their own websites and mobile apps using cloud AI and Google Search algorithms.

Kurian’s blog post also served as a reminder to retailers that they can buy Google Ads to surface their products when customers use Google’s many consumer tools like Search, YouTube, Shopping, Google Assistant or Maps.

We noted:

Google also announced Google Cloud 1:1 Engagement for Retail, a set of best practices that can help retailers build data-driven strategies for personalized customer services. This should make it easier for customers to use Google’s BigQuery data analytics platform to build personalization and recommendation models.

That is just the beginning! Google is also developing a Buy Optimization and Demand Forecasting service that assisted retailers plan and manage supply changes. There is also a new retail version of Anthos, Google’s platform for managing services on site or the cloud environment. It will allow retailers to roll our and manage applications across all stores.

What happens if we try to add 1 + 1. DarkCyber thinks that the task reveals several facets of Googley thinking:

Microsoft has lots of Windows 10 users who just use Credge. The browser works, is there, and why hunt for a different way to look at Web pages. But what if Credge gets traction on a mobile phone? What if Microsoft, the long time drone target of the Google, gets eyeballs on Android devices? Yep, those victory cheers are likely to become verbal and physical tics.
Amazon is selling ads. Selling lots of ads is not good for the Google, a company for more than 20 years has had one revenue stream of significance. The Google wants to put some sand in the fuel tank of Bezos bulldozer. Thus, Googley behavior dictates action.
Google itself faces a problem few companies have: Indexing the Web and changes to Web pages is expensive. How does one cut costs when Microsoft may blindly wreck havoc in the browser revenue flow or put a dent in the quite robust mobile ad business? How does Google protect existing ad revenue and possibly cause the Bezos bulldozer to go down for an engine overhaul? Aggressive actions seems to be the order of the day.

If DarkCyber steps back or in the lingo of a University of Chicago philosopher “go up a meta lever”, the actions of today’s Googlers reflect some changes which may give pause. Marketing has never been a Googley strength. Now it has to be competitive marketing. Is Google’s marketing elegant? Yeah, not too elegant.

Can Google control costs without further compromising its search service, its wonky innovations, and its increasingly contentious employee-management interactions?

DarkCyber finds the Credge and Bezos bulldozer “plays” interesting and entertaining.

Stephen E Arnold, February 27, 2020

Written by Stephen E. Arnold · Filed Under Business strategy, Google, Microsoft, News | Comments Off on Google: Feeling the Competitive Heat

Betting $11 Million That Content Processing Can Be Fixed

February 13, 2020

The Semantic Web, data lakes, data ponds, dark data, federated information, natural language processing — you have heard the buzzwords for years. The solution? MarkLogic, IBM (Data Fountain, OmniFind, Vivisimo, or Watson), social graph outfits like CluedIn, and Google’s Ramanathan Guha inventions. What about Kapow? And there are others, hundreds maybe.

Nevertheless, making sense of oceans of digital information is a bit of task. What MBA-inspired manager asks about document exception folders? Ah, what’s that mean? Just delete them because no one wants to explain. It is Foosball time.

“AI Document Engineering Startup Docugami Raises $10M Seed Round in Unusually Large Early Stage Deal” reports some interesting information; for example:

Some former Microsofties did not gain traction at the Amazon-chasing Redmond firm

Funding sources include an assortment of investment firms SignalFire and NextWorld Capital. There are some people with links to the Google

What does Docugami seek to do? The article states:

The startup’s technology uses artificial intelligence to help users create documents such as contracts and reports that can then be analyzed in the aggregate as if the contents were stored in a structured database.

Okay, smart software, machine learning, computer vision, and “unique XML approaches.”

The millions of money indicate that the company founder Jean Paoli (who had his fingers on the keyboard cranking out the XML standard) can tell a heck of a story. The official word for this craft is “creating a narrative.”

The most interesting factoid in the write up is the multiple references to InfoPath. As you may know, InfoPath appears in Office 2003 and disappeared in 2014. Like many Microsoft ideas, filling in the blanks — like filling out a form to get work at Wendy’s — is a logical way to get users to generate structured data. Yeah, well. InfoPath is still around, and there are some rah rah users, but support officially ends in 2026. (Some of those users like forms and spend lots of money for SharePoint and other Microsoft works in progress.)

What happened to InfoPath other than not becoming the next Azure super service? XML and structured data for information in email, note apps, Excel files used to allow analysts to write their reports in a spreadsheet, and other Microsoft products was not a home run. That’s one problem, and the idea is to let smart software apply structure, assign index terms, extract named entities, and perform “knowledge extraction.” Sounds easy. Yeah, well.

But the federation issue has some other facets, and it is not clear if the Docugami approach will solve these; for example:

Does a company want software to have access to content which may be confidential, incriminating, or restricted by law or common sense (that new drug in trial seems to be killing people so let’s not index that)?
How does a content and indexing system deal with the wild and crazy information on the Internet? Some of that information may be important in litigation, competitive intelligence, and personal idiosyncrasies like comments added to certain interesting social media content.
What happens when copyrighted material is sucked into the Docugami digital weather system? What happens when pornographic, drug related, and other information of a possible criminal nature is indexed along with those human resource salary data and the actual earnings data on the CFO’s computing device?
Where will the content reside? What’s the cost for storage, transmission, updating, and flagging “incorrect” data?

For quite specific types of content, InfoPath and probably Docugami makes sense.

But the narrative may be more important than the word painting to describe a world in which information is at one’s fingertips.

Is DarkCyber skeptical? Not at all. There is insufficient information at this time to determine if those millions are bet on a potential Kentucky Derby winner or a creature who will spend its life carrying kids around a dude ranch’s pony ride.

Stephen E Arnold, February 13, 2020

Written by Stephen E. Arnold · Filed Under Business strategy, Investment, News, XML | 1 Comment

AWS AI Improves Its Accuracy According to Amazon

January 31, 2020

An interesting bit of jargon creeps into “On Benchmark Data Set, Question-Answering System Halves Error Rate.” That word is “transfer.” Amazon, it seems, is trying to figure out how to reuse data, threshold settings, and workflow outputs.

Think about IBM’s DeepBlue defeat of Gary Kasparov in 1996 or the IBM Watson thing allegedly defeating Ken Jenkins in 2011 without any help from post production or judicious video editing. Two IBM systems and zero “transfer” or more in more Ivory Towerish jargon “transference.”

Humans learn via transfer. Artificial intelligence, despite the marketer assurances, don’t transfer very well. One painful and expensive fact of life which many venture funding outfits ignore is that most AI innovations start from ground zero for each new application of a particular AI technology mash up.

Imagine if DeepBlue were able to transfer its “learnings” to Watson. IBM may have avoided becoming a poster child for inept technology marketing. Watson is now a collection of software modules, but these don’t transfer particularly well. Hand crafting, retraining, testing, tweaking, and tuning are required and then must be reapplied as data drift causes “accuracy” scores to erode like a 1971 Vega.

Amazon suggests that it is making progress on the smart software transference challenge. The write up states:

Language models can be used to compute the probability of any given sequence (even discontinuous sequences) of words, which is useful in natural-language processing. The new language models are all built atop the Transformer neural architecture, which is particularly good at learning long-range dependencies among input data, such as the semantic and syntactic relationships between individual words of a sentence.

DarkCyber has dubbed some of these efforts as Bert and Ernie exercises, but that point of view is DarkCyber’s, not the views of those with skin in the AI game.

Amazon adds:

Our approach uses transfer learning, in which a machine learning model pretrained on one task — here, word sequence prediction — is fine-tuned on another — here, answer selection. Our innovation is to introduce an intermediate step between the pre-training of the source model and its adaptation to new target domains.

Yikes! A type of AI learning. The Amazon approach is named Tanda, not Ernie thankfully. Here’s a picture of how Tanda (transfer and adapt) works:

The write up reveals more about how the method functions.

The key part of the write up, in DarkCyber’s opinion, is the “accuracy” data; to wit:

On WikiQA and TREC-QA, our system’s MAP was 92% and 94.3%, respectively, a significant improvement over the previous records of 83.4% and 87.5%. MRR for our system was 93.3% and 97.4%, up from 84.8% and 94%, respectively.

If true, Amazon has now officially left Google, Microsoft, and others working to reduce the costs of training machine learning systems and delivering many wonderful services with a problem.

Most smart systems are fortunate to hit 85 percent accuracy in carefully controlled lab settings. Amazon is nosing into an accuracy range few humans can consistently deliver when indexing, classifying, or identifying if a picture that looks like a dog is actually a dog.

DarkCyber generally doubts data produced by a single research team. That rule holds for these data. Since the author of the report works on Alexa search, maybe Alexa will be able to answer this question, “Will Amazon overturn Microsoft’s JEDI contract award?”

Jargon is one thing. Real world examples are another.

Stephen E Arnold, January 31, 2020

Written by Stephen E. Arnold · Filed Under AI, Amazon, News | Comments Off on AWS AI Improves Its Accuracy According to Amazon

Calling Out Search: Too Little, Too Late

January 20, 2020

The write up’s title is going to be censored in DarkCyber. We are not shrinking violets, but we think that stop word lists do exist. Problem? Buzz your favorite ad supported search vendor and voice your complaints.

The write is “How Is Search So #%&! Bad? A ‘Case Study’.” The author appears to be frustrated with the outputs of ad supported and probably other types of seemingly “free” search systems providing links to Web content. This is what some people call “open source intelligence online”. There are other information resources available, but most of the consumer oriented, eyeball hungry vendors ignore i2p, forums with minimal traffic, what some experts call the Dark Web, and even some government information services. How many people pay any attention to the US National Archives? Be honest in your assessment.

Here’s a passage we noted:

Google Search is ridiculously, utterly bad.

This seems clear.

The write up provides some examples, but I anticipate that some other people have found that the connection between a user’s query and the Google search outputs is tenuous at best. One criticism DarkCyber has of the write up is that it mentions Google, shifts to Reddit, and then to metadata. The key point for us was the focus on time.

Now time is an interesting issue in indexing. Years ago I did a research project on the “meaning” of “real time” in online services. I think my research team identified five or six different types of time. I will skip the nuances we identified and focus only on the data or freshness of an item in a results list.

Let’s by sympathetic to the indexing company. Here’s why:

First, many documents do not provide an explicit date in the text of the article. In Beyond Search and DarkCyber, you will notice that we provide the author’s name and a day and data at which the article was posted. Many write ups on the open Web don’t bother. In fact, there will be no easy way to date the time the author posted the story within the content displayed in a browser. Don’t you love news releases which do not include a date, time, and time zone?

Second, many write ups include dates and times in the text of an article. For example, the reference to Day 2 of the recent CES trade show may include the explicit date January 8, 2020, for a product announcement. The approach is similar to using CES without spelling out “Consumer Electronics Show.” Buy, hey, these folks are busy, and everyone in the know understands the what and when, right?

Third, auto-assigned dates by operating systems may be “correct” when a file or content object is created. But what happens when a file or drive is restored? The original dates and metadata may be replaced with the time stamp of the restore. What about date last accessed or date last changed? Too much detail. Yada yada.

Fourth, time sorting is possible. Google invested in Recorded Future (now part of Insight). I had heard that someone at the GOOG thought Recorded Future’s time functions were nifty. Guess not. Google did not implement more sophisticated time functions in any service other than those related to advertising. For the great unwashed masses of those who don’t work at Google, tough luck I supposed.

Fifth, when was the content first indexed. More significantly, when was the content last updated. Important? May be, gentle reader. May be.

There are several other conditions as well. For the purposes of a blog post, I want to make clear: The person who is annoyed with search should have been annoyed decades ago. These time problems are not new, and they are persistent.

The author with a penchant for tardy profanity stated:

Part of the issue in this specific case is that they’ve started ignoring settings for displaying results from specific time periods. It’s definitely not the whole issue though, and not something new or specific to phone searches. Now, I’ve always been biased towards the new – books, tech, everything, but I can’t help but feel that a lot of things which were done pretty well before are done worse today. We do have better technology, yet we somehow build inferior solutions with it all too often. Further, if they had the same bias of showing me only recent results I’ll understand it better, but that’s not even the case. And yes, I get that the incentives of users and providers don’t align perfectly, that Google isn’t your friend, etc. But what is DDG’s excuse? As for the Case Study part, and me saying this isn’t simply a rant – I lied, hence the quotation marks in the title. Don’t trust everything you read, especially the goddamn dates on your search results.

The write up omits a few other minor problems with modern search and retrieval systems. Yep, this includes Reddit, LinkedIn, and a bunch of others. Let me provide a few dot points:

Poorly implemented Boolean search
Zero information about what’s in an index
Zero information about what’s excluded from and index and why
Minimal auto linking to information about an “author” or the “source” of the content
No data to make a precision or recall calculation possible and reproducible
No data to make it possible to determine overlap among Web indexes. Analyses must be brute forced. Due to the volatility, latency, and editorial vagaries of ad supported Web search systems, data are mostly suggestive.

Why? Why are none of these dot points operative?

Answer: Too expensive, too hard, not appropriate for our customers, and “What are you talking about? We never heard of half these issues you identified.”

Net net: Years ago I wrote an article for Searcher Magazine, edited at the time by Barbara Quint, a bit of an expert in online information retrieval. She worked at RAND for a number of years as an information expert. She said, “Do you really want me to use the title ‘Search Sucks’ on your article.” I told her, use whatever title you want. But if you agree with me, go with “sucks.” She used “sucks”. Let’s see that was a couple of decades ago.

Did anyone care? Nope. Does anyone care today? Nope. There you go.

Stephen E Arnold, January 20, 2020

Written by Stephen E. Arnold · Filed Under Business strategy, News, Search | Comments Off on Calling Out Search: Too Little, Too Late

A Taxonomy Vendor: Still Chugging Along

January 15, 2020

Semaphore Version 5 from Smartlogic coming soon.

An indexing software company— now morphed into a semantic AI outfit — Smartlogic promises Version 5 of its enterprise platform, Semaphore, will be available any time now.

The company modestly presents the announcement below the virtual fold in the company newsletter, “The Semaphore—Smartlogic’s Quarterly Newsletter—December 2019.” The General Access release should be out by the end of January. We’re succinctly informed because in indexing succinct is good:

“Semaphore 5 embodies innovative technologies and strategies to deliver a unified user experience, enhanced interoperability, and flexible integration:

*A single platform experience – modules are tightly integrated.

*Intuitive and simplified installation and administration – software can be download and configured with minimal clicks. An updated landing page allows you to quickly navigate modules and monitor status.

*Improved coupling of classification and language services, as well as improved performance.

*Updated the linguistic model and fact extraction capabilities.

*New – Document Semantic Analyzer – a performant content analyzer that provides detailed classification and language services results.

*New branding that aligns modules with capabilities and functionality.

“Semaphore 5 continues to focus around 3 core areas – Model & collaborate; fact extraction, auto-classification & language services; and integrate & visualize – in a modular platform that allows you to add capabilities as your business needs evolve. As you upgrade to Semaphore 5, you will be able to take advantage of the additional components and capabilities incorporated in your licensed modules.”

Semaphore is available on-premise, in the cloud, or a combination. Smartlogic (not to be confused with the custom app company Smartlogic) was founded in 2006 and is based in San Jose, California. The company owns SchemaLogic. Yep, we’re excited too. Maybe NLP, predictive analytics, and quantum computing technology will make a debut in this release. If not in software, perhaps in the marketing collateral?

Cynthia Murrell, January 15, 2020

Written by Stephen E. Arnold · Filed Under Indexing, News | Comments Off on A Taxonomy Vendor: Still Chugging Along

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.