AI Tech Used to Index and Search Joint Pathology Center Archive

November 23, 2020

The US’s Joint Pathology Center is the proud collector of the world’s largest group of preserved human tissue samples. Now, with help from University of Waterloo’s KIMIA Lab in Ontario, Canada, the facility will soon be using AI to index and search its digital archive of samples. ComputerUser announces the development in, “Artificial Intelligence Search Technology Will Be Used to Help Modernize US Federal Pathology Facility.”

As happy as we are to see the emergence of effective search solutions, we are also ticked by the names KIMIA used—the image search engine is commercialized under the name Lagotto, and the image retrieval tech is dubbed Yottixel. The write-up tells us:

“Yottixel will be used to enhance biomedical research for infectious diseases and cancer, enabling easier data sharing to facilitate collaboration and medical advances. The JPC is the leading pathology reference centre for the US federal government and part of the US Defense Health Agency. In the last century, it has collected more than 55 million glass slides and 35 million tissue block samples. Its data spans every major epidemic and pandemic, and was used to sequence the Spanish flu virus of 1918. It is expected that the modernization also helps to better understand and fight the COVID-19 pandemic. … Researchers at Waterloo have obtained promising diagnostic results using their AI search technology to match digital images of tissue samples in suspected cancer cases with known cases in a database. In a paper published earlier this year, a validation project led by Kimia Lab achieved accurate diagnoses for 32 kinds of cancer in 25 organs and body parts.”

Short for the Laboratory for Knowledge Inference in Medical Image Analysis, KIMIA Lab focuses on mass image data in medical archives using machine learning schemes. Established in 2013 and hosted by the University of Waterloo’s Faculty of Engineering, the program trains students and hosts international visiting scholars.

Cynthia Murrell, November 23, 2020

Defeating Facial Recognition: Chasing a Ghost

August 12, 2020

The article hedges. Check the title: “This Tool could Protect Your Photos from Facial Recognition.” Notice the “could”. The main idea is that people do not want their photos analyzed and indexed with the name, location, state of mind, and other index terms. I am not so sure, but the write up explains with that “could” coloring the information:

The software is not intended to be just a one-off tool for privacy-loving individuals. If deployed across millions of images, it would be a broadside against facial recognition systems, poisoning the accuracy of the data sets they gather from the Web. <

So facial recognition = bad. Screwing up facial recognition = good.

There’s more:

“Our goal is to make Clearview go away,” said Dr Ben Zhao, a professor of computer science at the University of Chicago.

Okay, a company is a target.

How’s this work:

Fawkes converts an image — or “cloaks” it, in the researchers’ parlance — by subtly altering some of the features that facial recognition systems depend on when they construct a person’s face print.

Several observations:

  • In the event of a problem like the explosion in Lebanon, maybe facial recognition can identify some of those killed.
  • Law enforcement may find narrowing a pool of suspects to a smaller group may enhance an investigative process.
  • Unidentified individuals who are successfully identified “could” add precision to Covid contact tracking.
  • Applying the technology to differentiate “false” positives from “true”positives in some medical imaging activities may be helpful in some medical diagnoses.

My concern is that technical write ups are often little more than social polemics. Examining the upside and downside of an innovation is important. Converting  a technical process into a quest to “kill” a company, a concept, or an application of technical processes is not helpful in DarkCyber’s view.

Stephen E Arnold, August 12, 2020

Twitter: Another Almost Adult Moment

August 7, 2020

Indexing is useful. Twitter seems to be recognizing this fact. “Twitter to Label State-Controlled News Accounts” reports:

The company will also label the accounts of government-linked media, as well as “key government officials” from China, France, Russia, the UK and US. Russia’s RT and China’s Xinhua News will both be affected by the change. Twitter said it was acting to provide people with more context about what they see on the social network.

Long overdue, the idea of an explicit index term may allow some tweeters to get some help when trying to figure out where certain stories originate.

Twitter, a particularly corrosive social media system, has avoided adult actions. The firm’s security was characterized in a recent DarkCyber video as a clown car operation. No words were needed. The video showed a clown car.

Several questions from the DarkCyber team:

  1. When will Twitter verify user identities, thus eliminating sock puppet accounts? Developers of freeware manage this type of registration and verification process, not perfectly but certainly better than some other organizations’.
  2. When will Twitter recognize that a tiny percentage of its tweeters account for the majority of the messages and implement a Twitch-like system to generate revenue from these individuals? Pay-per-use can be implemented in many ways, so can begging for dollars. Either way, Twitter gets an identification point which may have other functions.
  3. When will Twitter innovate? The service is valuable because a user or sock puppet can automate content regardless of its accuracy. Twitter has been the same for a number of Internet years. Dogs do age.

Is Twitter, for whatever reason, stuck in the management mentality of a high school science club which attracts good students, just not the whiz kids who are starting companies and working for Google type outfits from their parents’ living room?

Stephen E Arnold, August 7, 2020

NLP with an SEO Spin

July 8, 2020

If you want to know how search engine optimization has kicked librarians and professional indexers in the knee and stomped on their writing hand, you will enjoy “Classifying 200,000 Articles in 7 Hours Using NLP” makes clear that human indexers are going to become the lamp lighters of the 21st century. Imagine. No libraries, no subject matter experts curating and indexing content, no human judgment. Nifty. Perfect for a post Quibi world.

The write up explains the indexing methods of one type of smart software. The passages below highlights the main features of the method:

Weak supervision: the human annotator explains their chosen label to the AI model by highlighting the key phrases in the example that helped them make the decision. These highlights are then used to automatically generate nuanced rules, which are combined and used to augment the training dataset and boost the model’s quality.

Uncertainty sampling: it finds those examples for which the model is most uncertain, and suggests them for human review.
Diversity sampling: it helps make sure that the dataset covers as diverse a set of data as possible. This ensures the model learns to handle all of the real-world cases.

Guided learning: it allows you to search through your dataset for key examples. This is particularly useful when the original dataset is very imbalanced (it contains very few examples of the category you care about).

These phrases may not be clear. May I elucidate:

  • Weak supervision. Subject matter experts riding herd. No way. Inefficient and not optimizable.
  • Uncertainty sampling means a “fudge factor” or “fuzzifying.” A metaphor might be “close enough for horse shoes.”
  • Guided learning. Yep, manual assembly of training data, recalibration, and more training until the horse shoe thing scores a point.

The write up undermines its good qualities with a reference to Google. Has anyone noticed that Google’s first page of results for most of my queries are advertisements.

NLP and horse shoes. Perfect match. Why are the index and classification codes those which an educated person would find understandable and at hand? Forget answering this question. Just remember good enough and close enough for horse shoes. Clang and kha-ching as another ad sucks in a bidder.

Stephen E Arnold, July 8, 2020

Smartlogic: Making Indexing a Thing

May 29, 2020

Years ago, one of the wizards of Smartlogic visited the DarkCyber team. The group numbered about seven of my loyal researchers. These were people who had worked on US government projects, analyses for now disgraced banks in NYC, and assorted high technology firms. Was the world’s largest search system in this list? Gee, I don’t recall.

In that presentation, Smartlogic’s wizard explained that indexing, repositioned as tagging was important. Examples of the values of metatagging (presumably a more advanced form of the 40 year old classification codes used in the ABI/INFORM database since — what? — 1983. Smartlogic embarked on a mini acquisition spree, purchasing the interesting Schemalogic company about a decade ago.

What did Schemalogic do? In addition to being a wonderland for Windows Certified Professionals, the “server” managed index terms. The idea was that people in different departments assigned key words to different digital entities; for example, an engineer might assign the key word “E12.” This is totally clear to a person who thinks about resistors, but to a Home Economics graduate working in marketing the E12 was a bit of a puzzle. The idea that an organization in the pre Covid days could develop a standard set of tags is a fine idea. There are boot camps and specialist firms using words like taxonomy or controlled terms in their marketing collateral. However, humans are not too good at assigning terms. Humans get tired and fall back upon their faves. Other humans are stupid, bored, or indifferent and just assign terms and be done with it. Endeca’s interesting Guided Navigation worked because the company cleverly included consulting in a license. The consulting consisted of humans who worked up the needed vocabulary for a liquor store or preferably an eCommerce site with a modest number of products for sale. (There are some computational challenges inherent in the magical Endeca facets.)

Consequently massive taxonomy projects come and then fade. A few stick around, but these are often hooked to applications with non volatile words. The Engineering Index is a taxonomy, but its terminology is of scant use to an investment bank. How about a taxonomy for business? ABI/INFORM created, maintained, and licensed its vocabulary to outfits like the Royal Bank of Canada. However, ABI/INFORM moved into the brilliant managers at other firms. I assume a biz dev professional at whatever owner possesses rights to the vocabulary will cut a deal.

Back to Smartlogic.

Into this historical stew, Smartlogic offered a better way. I think that was the point of the marketing presentation we enjoyed years ago. Today the words have become more jargon centric, but the idea is the same: Index in a way that makes it possible to find that E12 when the vocabulary of the home ec major struggles with engineer-speak.

Our perception evolved. Smartlogic dabbled in the usual markets. Enterprise search vendors pushed into customer support. Smartlogic followed. Health and medical outfits struggled with indexing content and medical claims form. Indexing specialists followed the search vendors. Smartlogic has enthusiastically chased those markets as well. An exit for the company’s founders has not materialized. The dream of many — a juicy IPO — must work through the fog of the post Covid business world.

The most recent pivot is announced this way:


Smartlogic now offers indexing for these sectors expressed in what may be Smartlogic compound controlled terms featuring conjunctions. There you go, Bing, Google, Swisscows, Qwant, and Yandex. Parse these and then relax the users’ query. That’s what happens to well considered controlled terms today DarkCyber knows.

  • Energy and utilities
  • Financial services and insurance
  • Health care
  • High tech and manufacturing
  • Media and publishing
  • Life sciences
  • Retail and consumer products
  • and of course, intelligence (presumably business, military, competitive, and enforcement).

Is the company pivoting or running a Marketing 101 game plan?

DarkCyber noted that Smartlogic offers a remarkable array of services, technologies (including our favorites semantic and knowledge management), and — wait for it — artificial intelligence.

Interesting. Indexing is versatile and definitely requires a Swiss Army Knife of solutions, a Gartner encomium, and those pivots. Those spins remains anchored to indexing.

Want to know more about Smartlogic? Navigate to the company’s Web site. There’s even a blog! Very reliable outfit. Quick response. Objective. Who could ask for anything more?

Stephen E Arnold, May 29, 2020

YouTube and Objective Search Results

May 13, 2020

DarkCyber, working from a run down miner’s camp in rural Kentucky, does not understand the outside world. One of the DarkCyber research team who actually graduated from middle school spotted this article: “YouTube CEO Admits Users Don’t Like Boosting Of “Authoritative” Mainstream Channels, But They Do It Anyway.”

The article appears to present information implicating the most popular video service in Eastern Europe, including and the “stans” in some surprising activities.

The article asserts:

YouTube CEO Susan Wojcicki admits that the company knows its users don’t like the video giant rigging its own algorithm to boost “authoritative” mainstream sources, but that they do it anyway.

The article notes:

For several years now, the company has artificially gamed its own search engine to ensure that independent content creators are buried underneath a wall of mainstream media content. This rigging is so severe that the company basically broke its own search engine, with some videos posted by independent creators almost impossible to find even if the user searches for the exact title.

One fascinating connection between the providers of content from Van Wives is:

the company’s disdain for its own user base was also underscored by its Chief Product Officer Neil Mohan insulting non-mainstream YouTube creators as basement-dwelling idiots. This followed a new policy by the company to remove any content that challenged the World Health Organization’s official coronavirus guidelines, despite the fact that those guidelines have changed numerous times.

Here in Kentucky, the world is shaped by individuals walking along empty roads and mostly unused trails in the hills.

When big city information like this reaches the DarkCyber research team, our first instinct is to search Google and YouTube, maybe Google News or the comprehensive Google Scholar indexes. But this write up suggests that the information displayed may be subjective, the team is confused.

The team believes that what appears in the Google search results is accurate.

Sometimes we don’t believe the state’s environmental officer who has recently decided to wear shoes. The information in the hollow is that yellow green water is safe to drink.

Does this person obtain information as we do? A Google search? Are those Google algorithms the digital equivalent of the local grocer who puts his grimy thumb on the scale when weighing kiwano and feijoa? Our grocer tells us that durian smells great too.

Stephen E Arnold, May 13, 2020

No Fooling: Copyright Enforcer Does Indexing Too

April 1, 2020

The Associated Press is one of the oldest, most respected, and widely read news services in the world. As more than half the world reads Associated Press, it makes one wonder how the news services organizes and distributes its content. Synaptica has more details in the article, “Synaptica Insights: Veronika Zielinska, The Associated Press.”

Veronika Zielinska has a background in computational linguistics and natural language. She was interested in how automated tagging, taxonomies, and statistical engines apply rules to content. She joined Associated Press’s Information Management team in 2005, then moving up to the Metadata Technology team. Her current responsibilities are to develop the Metadata Services platform, fine tuning search quality and relevancy for content distribution platforms, scheme design, data transformations, analytics and business intelligence programs, and developing content enrichment methods.

Zielinska offers information on how the Associated Press builds a taxonomy:

“We looked at all the content that AP produced and scoped our taxonomy to cover all possible topics, events, places, organizations, people, and companies that our news production covered. News can be about anything – it’s broad, but we also took into account there are certain areas where AP produces more content than others. We have verticals that have huge news coverage – this can be government, politics, sports, entertainment and emerging areas like health, environment, nature, and education. Looking at the content and knowing what the news is about helps us to develop the taxonomy framework. We took this content base and divided the entire news domain into smaller domains. Each person on the team was responsible for their three or four taxonomy domains. They became subject and theme matter experts.”

The value of Associated Press’s taxonomies comes from the entire content package that includes everything from photos, articles, and videos centered around descriptive metadata that makes it agreeable and findable.

While the Associated Press is a non-profit news service, they do offer a platform called AP Metadata Services that is used by other news services. The Associated Press frequently updates its taxonomy with new terms when they enter the media. The AP taxonomy team works with the AP Editorial team to identify new terms and topics. The biggest challenges Zielinska faces are maintenance and writing in a manner that the natural language processing algorithms can understand it.

As for the future, Zielinska fears news services losing their budgets, local news not getting as much coverage, and the spread of misinformation. The biggest problem is that automated technologies can take the misinformation and disseminate it. She advises, “Managers can help by creating standardized vocabularies for fact checking across media types, for example, so that deep fakes and other misleading media can be identified consistently across various outlets.”

Whitney Grace, April 1, 2020

Swagiggle? Nope, Not an April Fooler

April 1, 2020

Big ecommerce sites like eBay and Amazon depend on a robust, accurate, and functional search engine. Without a powerful search application, searching for items on eBay and Amazon is like looking through every page of a printed catalog. The only difference is that there are millions of items compared to the thousands in one catalog. Amazon and eBay are not always accurate, especially when users edit and add content without being monitored. That means there is room for improvement and a startup to worm their way into the big leagues. Swagiggle is a:

“Swagiggle is a precision shopping search and product discovery website created by WAND, Inc. to demonstrate the capabilities of its taxonomy based product data organization and enrichment abilities featured in the WAND eCommerce Taxonomy Portal and PIM. WAND, Inc. is the world’s leading provider of pre-defined taxonomies, including the WAND Product and Service Taxonomy.

Have you ever had the experience of going to a category on an online retail site and seeing mis-categorized items? Or, a bunch of items dumped into a catch-all “Accessories” category. At Swagiggle, our goal is to provide accurate and specific categories so that our users can quickly find exactly the products they are looking for. From there, we assign product specifications so that users can filter through the items in a category and find exactly what they want.”

Wand’s Swagiggle sounds like an awesome product. Using products from its clients, Swagiggle offers an online catalog for users to search for products they wish to buy. These products range from clothing to cleaning products. The items are organized by large categories, then users man drill down to specific items or search with key words. It is a pretty standard search engine, but it has one major problem. The drilling down aspect does fill dated and half the time pictures and content would not load. The loading time is extraordinary long too. Plus, due to the variety of their clients, items offered on Swagiggle are very random. Swagiggle needs tofu the broken pictures and figure out how to make itself faster.

Whitney Grace, April 1, 2020

Intelligent Tagging Makes Unstructured Data Usable

March 20, 2020

We are not going to talk about indexing accuracy. Just keep that idea in mind, please.

Unstructured data is a nightmare nobody wants to handle. Within a giant unstructured mess, however, is usable information. How do you get to the golden information? There are multiple digital solutions, software applications, and big data tools that are supposed to get the job done. It raises another question: which tool do you choose? Among these choices is Intelligent Tagging from Refinitiv.

What is “intelligent tagging?”

“Intelligent Tagging uses natural language processing, text analytics and data-mining technologies to derive meaning from vast amounts of unstructured content. It’s the fastest, easiest and most accurate way to tag the people, places, facts and events in your data, and then assign financial topics and themes to increase your content’s value, accessibility and interoperability. Connecting your data consistently with Intelligent Tagging helps you to search smarter, personalize content recommendations and generate alpha.”

Intelligent Tagging can read through gigabytes of different textual information (emails, texts, notes, etc.) using natural language processing. The software structures data by assigning them tags, then forming connections from the content. After the information is organized, the search is empowered to quickly locate the desired information. Content can be organized in a variety of ways such as companies, people, location, topics, and more. Relevancy scores are added to determine how relevant a search indicator is to the search results. Intelligent Tagging also updates itself in real time by paying attention to the news and adding new metadata tags.

It is an optimized search experience and yields more powerful results in less time than similar software.

Intelligent Tagging offers a necessary service, but the only way to see if it promises to bring structure to data piles is to test it out.

Whitney Grace, March 20, 2020

IslandInText Reborn: TLDRThis

March 16, 2020

Many years ago (maybe 25+), we tested a desktop summarization tool called IslandInText. [#1 below] I believe, if my memory is working today, this was software developed in Australia by Island Software. There was a desktop version and a more robust system for large-scale summarizing of text. In the 1980s, there was quite a bit of interest in automatic summarization of text. Autonomy’s system could be configured to generate a précis if one was familiar with that system. Google’s basic citation is a modern version of what smart software can do to suggest what’s in a source item. No humans needed, of course. Too expensive and inefficient for the big folks I assume.

For many years, human abstract and indexing professionals were on staff. Our automated systems, despite their usefulness, could not handle nuances, special inclusions in source documents like graphs and tables, list of entities which we processed with the controlled term MANYCOMPANIES, and other specialized functions. I would point out that most of today’s “modern” abstracting and indexing services are simply not as good as the original services like ABI / INFORM, Chemical Abstracts, Engineering Index, Predicasts, and other pioneers in the commercial database sector. (Anyone remember Ev Brenner? That’s what I thought, gentle reader. One does not have to bother oneself with the past in today’s mobile phone search expert world.)

For a number of years, I worked in the commercial database business. In order to speed the throughput of our citations to pharmaceutical, business, and other topic domains – machine text summarization was of interest to me and my colleagues.

A reader informed me that a new service is available. It is called TLDRThis. Here’s what the splash page looks like:


One can paste text or provide a url, and the system returns a synopsis of the source document. (The advanced service generates a more in dept summary, but I did not test this. I am not too keen on signing up without knowing what the terms and conditions are.) There is a browser extension for the service. For this url, the system returned this summary:

Enterprise Search: The Floundering Fish!

Stephen E. Arnold Monitors Search,Content Processing,Text Mining,Related Topics His High-Tech Nerve Center In Rural Kentucky.,He Tries To Winnow The Goose Feathers The Giblets. He Works With Colleagues,Worldwide To Make This Web Log Useful To Those Who Want To Go,Beyond Search . Contact Him At Sa,At,Arnoldit.Com. His Web Site,With Additional Information About Search Is  |    Oct 27, 2011  |  Time Saved: 5 mins

  1. I am thinking about another monograph on the topic of “enterprise search.” The subject seems to be a bit like the motion picture protagonist Jason.
  2. The landscape of enterprise search is pretty much unchanged.
  3. But the technology of yesterday’s giants of enterprise search is pretty much unchanged.
  4. The reality is that the original Big Five had and still have technology rooted in the mid to late 1990s.

We noted several positive functions; for example, identifying the author and providing a synopsis of the source, even the goose feathers’ reference. On the downside, the system missed the main point of the article; that is, enterprise search has been a bit of a chimera for decades. Also, the system ignored the entities (company names) in the write up. These are important in my experience. People search for names, concepts, and events. The best synopses capture some of the entities and tell the reader to get the full list and other information from the source document. I am not sure what to make of the TLDRThis’ display of a picture which makes zero sense without the context of the full article. I fed the system a PDF which did not compute and I tried a link which generated a request to refresh the page, not the summary.

To get an “advanced summary”, one must sign up. I did not choose to do that. I have added this site to our “follow” list. I will make a note to try and find out who developed this service.

The pricing ranges from free for basic summarization to $60 per year for Bronze level service. Among its features are 100 summaries per month and “exclusive features”. These are coming soon. The top level service is $10 per month. The fee includes 300 summaries a month and “exclusive features.” These are also coming soon. The Platinum service is $20 per month and includes 1,000 summaries per month. These are “better” and will include forthcoming advanced features.

Stay tuned.

[#1 ] In the early 1990s, search and retrieval was starting to move from the esoteric world of commercial databases to desktop and UNIX machines. IslandSoft, founded in 1993, offered a search and retrieval system. My files from this time revealed that IslandSoft’s description of its system could be reused by today’s search and retrieval marketers. Here’s what IslandSoft said about InText:

IslandInTEXT is a document retrieval and management application for PCs and Unix workstations. IslandInTEXT’s powerful document analysis engine lets users quickly access documents through plain English queries, summarize large documents based on content rather than key words, and automatically route incoming text and documents to user-defined SmartFolders. IslandInTEXT offers the strongest solution yet to help organize and utilize information with large numbers of legacy documents residing on PCs, workstations, and servers as well as the proliferation of electronic mail documents and other data. IslandInTEXT supports a number of popular word processing formats including IslandWrite, Microsoft Word, and WordPerfect plus ASCII text.

IslandInTEXT Includes:

  • File cabinet/file folder metaphor.
  • HTML conversion.
  • Natural language queries for easily locating documents.
  • Relevancy ranking of query results.
  • Document summaries based on statistical relevance from 1 to 99% of the original document—create executive summaries of large documents instantly. [This means that the user can specify how detailed the summarization was; for example, a paragraph or a page or two.]
  • Summary Options. Summaries can be based on key word selection, key word ordering, key sentences, and many more.

[For example:] SmartFolder Routing. Directs incoming text and documents to user-defined folders. Hot Link Pointers. Allow documents to be viewed in their native format without creating copies of the original documents. Heuristic/Learning Architecture. Allows InTEXT to analyze documents according to the author’s style.

A page for InText is still online as of today at The company appears to have ceased operations in 2010. Data in my files indicate that the name and possibly the code is owned by CP Software, but I have not verified this. I did not include InText in my first edition of Enterprise Search Report, which I wrote in 2003 and 2004. The company had falled behind market leaders Autonomy, Endeca, and Fast Search & Transfer.

I am surprised at how many search and retrieval companies today are just traveling along well worn paths in the digital landscape. Does search work? Nope. That’s why there are people who specialize, remember things, and maintain personal files. Mobile device search means precision and recall are digital dodo birds in my opinion.

Stephen E Arnold, March 16, 2020


« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta