No Fooling: Copyright Enforcer Does Indexing Too
April 1, 2020
The Associated Press is one of the oldest, most respected, and widely read news services in the world. As more than half the world reads Associated Press, it makes one wonder how the news services organizes and distributes its content. Synaptica has more details in the article, “Synaptica Insights: Veronika Zielinska, The Associated Press.”
Veronika Zielinska has a background in computational linguistics and natural language. She was interested in how automated tagging, taxonomies, and statistical engines apply rules to content. She joined Associated Press’s Information Management team in 2005, then moving up to the Metadata Technology team. Her current responsibilities are to develop the Metadata Services platform, fine tuning search quality and relevancy for content distribution platforms, scheme design, data transformations, analytics and business intelligence programs, and developing content enrichment methods.
Zielinska offers information on how the Associated Press builds a taxonomy:
“We looked at all the content that AP produced and scoped our taxonomy to cover all possible topics, events, places, organizations, people, and companies that our news production covered. News can be about anything – it’s broad, but we also took into account there are certain areas where AP produces more content than others. We have verticals that have huge news coverage – this can be government, politics, sports, entertainment and emerging areas like health, environment, nature, and education. Looking at the content and knowing what the news is about helps us to develop the taxonomy framework. We took this content base and divided the entire news domain into smaller domains. Each person on the team was responsible for their three or four taxonomy domains. They became subject and theme matter experts.”
The value of Associated Press’s taxonomies comes from the entire content package that includes everything from photos, articles, and videos centered around descriptive metadata that makes it agreeable and findable.
While the Associated Press is a non-profit news service, they do offer a platform called AP Metadata Services that is used by other news services. The Associated Press frequently updates its taxonomy with new terms when they enter the media. The AP taxonomy team works with the AP Editorial team to identify new terms and topics. The biggest challenges Zielinska faces are maintenance and writing in a manner that the natural language processing algorithms can understand it.
As for the future, Zielinska fears news services losing their budgets, local news not getting as much coverage, and the spread of misinformation. The biggest problem is that automated technologies can take the misinformation and disseminate it. She advises, “Managers can help by creating standardized vocabularies for fact checking across media types, for example, so that deep fakes and other misleading media can be identified consistently across various outlets.”
Whitney Grace, April 1, 2020
Swagiggle? Nope, Not an April Fooler
April 1, 2020
Big ecommerce sites like eBay and Amazon depend on a robust, accurate, and functional search engine. Without a powerful search application, searching for items on eBay and Amazon is like looking through every page of a printed catalog. The only difference is that there are millions of items compared to the thousands in one catalog. Amazon and eBay are not always accurate, especially when users edit and add content without being monitored. That means there is room for improvement and a startup to worm their way into the big leagues. Swagiggle is a:
“Swagiggle is a precision shopping search and product discovery website created by WAND, Inc. to demonstrate the capabilities of its taxonomy based product data organization and enrichment abilities featured in the WAND eCommerce Taxonomy Portal and PIM. WAND, Inc. is the world’s leading provider of pre-defined taxonomies, including the WAND Product and Service Taxonomy.
Have you ever had the experience of going to a category on an online retail site and seeing mis-categorized items? Or, a bunch of items dumped into a catch-all “Accessories” category. At Swagiggle, our goal is to provide accurate and specific categories so that our users can quickly find exactly the products they are looking for. From there, we assign product specifications so that users can filter through the items in a category and find exactly what they want.”
Wand’s Swagiggle sounds like an awesome product. Using products from its clients, Swagiggle offers an online catalog for users to search for products they wish to buy. These products range from clothing to cleaning products. The items are organized by large categories, then users man drill down to specific items or search with key words. It is a pretty standard search engine, but it has one major problem. The drilling down aspect does fill dated and half the time pictures and content would not load. The loading time is extraordinary long too. Plus, due to the variety of their clients, items offered on Swagiggle are very random. Swagiggle needs tofu the broken pictures and figure out how to make itself faster.
Whitney Grace, April 1, 2020
Intelligent Tagging Makes Unstructured Data Usable
March 20, 2020
We are not going to talk about indexing accuracy. Just keep that idea in mind, please.
Unstructured data is a nightmare nobody wants to handle. Within a giant unstructured mess, however, is usable information. How do you get to the golden information? There are multiple digital solutions, software applications, and big data tools that are supposed to get the job done. It raises another question: which tool do you choose? Among these choices is Intelligent Tagging from Refinitiv.
What is “intelligent tagging?”
“Intelligent Tagging uses natural language processing, text analytics and data-mining technologies to derive meaning from vast amounts of unstructured content. It’s the fastest, easiest and most accurate way to tag the people, places, facts and events in your data, and then assign financial topics and themes to increase your content’s value, accessibility and interoperability. Connecting your data consistently with Intelligent Tagging helps you to search smarter, personalize content recommendations and generate alpha.”
Intelligent Tagging can read through gigabytes of different textual information (emails, texts, notes, etc.) using natural language processing. The software structures data by assigning them tags, then forming connections from the content. After the information is organized, the search is empowered to quickly locate the desired information. Content can be organized in a variety of ways such as companies, people, location, topics, and more. Relevancy scores are added to determine how relevant a search indicator is to the search results. Intelligent Tagging also updates itself in real time by paying attention to the news and adding new metadata tags.
It is an optimized search experience and yields more powerful results in less time than similar software.
Intelligent Tagging offers a necessary service, but the only way to see if it promises to bring structure to data piles is to test it out.
Whitney Grace, March 20, 2020
IslandInText Reborn: TLDRThis
March 16, 2020
Many years ago (maybe 25+), we tested a desktop summarization tool called IslandInText. [#1 below] I believe, if my memory is working today, this was software developed in Australia by Island Software. There was a desktop version and a more robust system for large-scale summarizing of text. In the 1980s, there was quite a bit of interest in automatic summarization of text. Autonomy’s system could be configured to generate a précis if one was familiar with that system. Google’s basic citation is a modern version of what smart software can do to suggest what’s in a source item. No humans needed, of course. Too expensive and inefficient for the big folks I assume.
For many years, human abstract and indexing professionals were on staff. Our automated systems, despite their usefulness, could not handle nuances, special inclusions in source documents like graphs and tables, list of entities which we processed with the controlled term MANYCOMPANIES, and other specialized functions. I would point out that most of today’s “modern” abstracting and indexing services are simply not as good as the original services like ABI / INFORM, Chemical Abstracts, Engineering Index, Predicasts, and other pioneers in the commercial database sector. (Anyone remember Ev Brenner? That’s what I thought, gentle reader. One does not have to bother oneself with the past in today’s mobile phone search expert world.)
For a number of years, I worked in the commercial database business. In order to speed the throughput of our citations to pharmaceutical, business, and other topic domains – machine text summarization was of interest to me and my colleagues.
A reader informed me that a new service is available. It is called TLDRThis. Here’s what the splash page looks like:
One can paste text or provide a url, and the system returns a synopsis of the source document. (The advanced service generates a more in dept summary, but I did not test this. I am not too keen on signing up without knowing what the terms and conditions are.) There is a browser extension for the service. For this url, the system returned this summary:
Enterprise Search: The Floundering Fish!
Stephen E. Arnold Monitors Search,Content Processing,Text Mining,Related Topics His High-Tech Nerve Center In Rural Kentucky.,He Tries To Winnow The Goose Feathers The Giblets. He Works With Colleagues,Worldwide To Make This Web Log Useful To Those Who Want To Go,Beyond Search . Contact Him At Sa,At,Arnoldit.Com. His Web Site,With Additional Information About Search Is | Oct 27, 2011 | Time Saved: 5 mins
- I am thinking about another monograph on the topic of “enterprise search.” The subject seems to be a bit like the motion picture protagonist Jason.
- The landscape of enterprise search is pretty much unchanged.
- But the technology of yesterday’s giants of enterprise search is pretty much unchanged.
- The reality is that the original Big Five had and still have technology rooted in the mid to late 1990s.
We noted several positive functions; for example, identifying the author and providing a synopsis of the source, even the goose feathers’ reference. On the downside, the system missed the main point of the article; that is, enterprise search has been a bit of a chimera for decades. Also, the system ignored the entities (company names) in the write up. These are important in my experience. People search for names, concepts, and events. The best synopses capture some of the entities and tell the reader to get the full list and other information from the source document. I am not sure what to make of the TLDRThis’ display of a picture which makes zero sense without the context of the full article. I fed the system a PDF which did not compute and I tried a bit.ly link which generated a request to refresh the page, not the summary.
To get an “advanced summary”, one must sign up. I did not choose to do that. I have added this site to our “follow” list. I will make a note to try and find out who developed this service.
The pricing ranges from free for basic summarization to $60 per year for Bronze level service. Among its features are 100 summaries per month and “exclusive features”. These are coming soon. The top level service is $10 per month. The fee includes 300 summaries a month and “exclusive features.” These are also coming soon. The Platinum service is $20 per month and includes 1,000 summaries per month. These are “better” and will include forthcoming advanced features.
Stay tuned.
[#1 ] In the early 1990s, search and retrieval was starting to move from the esoteric world of commercial databases to desktop and UNIX machines. IslandSoft, founded in 1993, offered a search and retrieval system. My files from this time revealed that IslandSoft’s description of its system could be reused by today’s search and retrieval marketers. Here’s what IslandSoft said about InText:
IslandInTEXT is a document retrieval and management application for PCs and Unix workstations. IslandInTEXT’s powerful document analysis engine lets users quickly access documents through plain English queries, summarize large documents based on content rather than key words, and automatically route incoming text and documents to user-defined SmartFolders. IslandInTEXT offers the strongest solution yet to help organize and utilize information with large numbers of legacy documents residing on PCs, workstations, and servers as well as the proliferation of electronic mail documents and other data. IslandInTEXT supports a number of popular word processing formats including IslandWrite, Microsoft Word, and WordPerfect plus ASCII text.
IslandInTEXT Includes:
- File cabinet/file folder metaphor.
- HTML conversion.
- Natural language queries for easily locating documents.
- Relevancy ranking of query results.
- Document summaries based on statistical relevance from 1 to 99% of the original document—create executive summaries of large documents instantly. [This means that the user can specify how detailed the summarization was; for example, a paragraph or a page or two.]
- Summary Options. Summaries can be based on key word selection, key word ordering, key sentences, and many more.
[For example:] SmartFolder Routing. Directs incoming text and documents to user-defined folders. Hot Link Pointers. Allow documents to be viewed in their native format without creating copies of the original documents. Heuristic/Learning Architecture. Allows InTEXT to analyze documents according to the author’s style.
A page for InText is still online as of today at http://www.intext.com/. The company appears to have ceased operations in 2010. Data in my files indicate that the name and possibly the code is owned by CP Software, but I have not verified this. I did not include InText in my first edition of Enterprise Search Report, which I wrote in 2003 and 2004. The company had falled behind market leaders Autonomy, Endeca, and Fast Search & Transfer.
I am surprised at how many search and retrieval companies today are just traveling along well worn paths in the digital landscape. Does search work? Nope. That’s why there are people who specialize, remember things, and maintain personal files. Mobile device search means precision and recall are digital dodo birds in my opinion.
Stephen E Arnold, March 16, 2020
Adobe PDF: Maybe as Interesting As Flash?
March 4, 2020
Adobe Portable Document Format files flashed on DarkCyber’s radar in the mid 1980s. Adobe pitched the virtues of PDF to big publishing companies. And Stephen E Arnold worked at such an organization at this time. I was given the job of examining the early version of PDF referenced by the code named Trapeze.
Trapeze artists fall to their death. Adobe Acrobat pulled off a spectacular trick, survived, became sort of open, and now seems to be a permanent part of the landscape decorated with the dumpsters burning Microsoft XPS Document Writer files.
A very good write up about the problems PDF files is FilingDB’s “What’s So Hard about PDF Text Extraction?” The information in this write up makes explicit why PDFs are not easy to manipulate, analyze, and mine.
The write up provides the data needed to understand that when a vendor says, “We process the hidden content in PDF files”, those vendors do not explain how much and what is omitted, ignored, and unindexed.
People believe that when specifying a filetype: command to Bing or Google delivers comprehensive content from PDF files. No way, sad to say. The same problem exists for any search or content processing vendor’s connectors for PDF files.
This is important when one is conducting mission critical data analysis, certain investigations, and other types of work in which “zero error” is the goal. Will the problem be remediated. Maybe, but I spotted in the 1980s, and it persists today.
Stephen E Arnold, March 4, 2020
TemaTres: Open Source Indexing Tool Updated
February 11, 2020
Open source software is the foundation for many proprietary software startups, including the open source developers themselves. Most open source software tends to lag in the manner of updates and patches, but TemaTres recently updated according to blog post, “TemaTres 3.1 Release Is Out! Open Source Web Tool To Manage Controlled Vocabularies.”
TemaTres is an open source vocabulary server designed to manage controlled vocabularies, taxonomies, and thesauri. The recent update includes the following:
“Utility for importing vocabularies encoded in MARC-XML format
- Utility for the mass export of vocabulary in MARC-XML format
- New reports about global vocabulary structure (ex: https://r020.com.ar/tematres/demo/sobre.php?setLang=en#global_view)
- Distribution of terms according to depth level
- Distribution of sum of preferred terms and the sum of alternative terms
- Distribution of sum of hierarchical relationships and sum of associative relationships
- Report about terms with relevant degree of centrality in the vocabulary (according to prototypical conditions)
- Presentation of terms with relevant degree of centrality in each facet
- New options to config the presentation of notes: define specific types of note as prominent (the others note types will be presented in collapsed div).
- Button for Copy to clipboard the terms with indexing value (Copy-one-click button)
- New user login scheme (login)
- Allows to config and add Google Analytics tracking code (parameter in config.tematres.php file)
- Improvements in standard exposure of metadata tags
- Inclusion of the term notation or code in the search box predictive text
- Compatibility with PHP 7.2”
TemaTres does updates frequently, but it is monitored. The main ethos about open source is to give back as much as you take. TemaTres appears to follow this modus operandi. It TemaTres wants to promote its web image, the organization should really upgrade its Web site, fix the broken links, and provide more information on what the software actually does.
Whitney Grace, February 11, 2020
Ontotext: GraphDB Update Arrives
January 31, 2020
Semantic knowledge firm Ontotext has put out an update to its graph database, The Register announces in, “It’s Just Semantics: Bulgarian Software Dev Ontotext Squeezes Out GraphDB 9.1.” Some believe graph databases are The Answer to a persistent issue. The article explains:
“The aim of applying graph database technology to enterprise data is to try to overcome the age-old problem of accessing latent organizational knowledge; something knowledge management software once tried to address. It’s a growing thing: Industry analyst Gartner said in November the application of graph databases will ‘grow at 100 per cent annually over the next few years’. GraphDB is ranked at eighth position on DB-Engines’ list of most popular graph DBMS, where it rubs shoulders with the likes of tech giants such as Microsoft, with its Azure Cosmos DB, and Amazon’s Neptune. ‘GraphDB is very good at text analytics because any natural language is very ambiguous: a project name could be a common English word, for example. But when you understand the context and how entities are connected, you can use these graph models to disambiguate the meaning,’ [GraphDB product manager Vassil] Momtchev said.”
The primary feature of this update is support for the Shapes Constraint Language, or SHACL, which the World Wide Web Consortium recommends for validating data graphs against a set of conditions. This support lets the application validate data against the schema whenever new data is loaded to the database instead of having to manually run queries to check. A second enhancement allows users to track changes in current or past database transactions. Finally, the database now supports network authentication protocol Kerberos, eliminating the need to store passwords on client computers.
Cynthia Murrell, January 31, 2020
Former Amazonian Suggests the Pre Built Models Are Pipe Dreams
January 30, 2020
I read a PR-infused write up with some interesting presumably accurate information. The article is from ZDNet.com (an outfit somewhat removed from Mr. Ziff’s executive dining room.) Its title? “Reality Engines Offers a Deep Learning Tour de Force to Challenge Amazon et al in Enterprise AI”. Here’s a passage which warranted an Amazon orange highlighter circle:
The goal, Reddy told ZDNet, is a service that “automatically creates production-ready models from data in the wild,” to ease the labor of corporations that don’t have massive teams of data scientists and deep learning programmers. “While other companies talk about offering this service, it is still largely a pipe-dream,” wrote Reddy in an email exchange with ZDNet. “We have made significant strides towards this goal,” she said.
Who will care about this assertion? Since the founder of the company is a former top dog of “AI verticals” at Amazon’s AWS cloud service”, Amazon may care. Amazon asserts that SageMaker and related tools make machine learning easier, faster, better (cheaper may depend on one’s point of view). A positive summary of some of Amazon’s machine learning capabilities appears in “Building Fully Custom Machine Learning Models on AWS SageMaker: A Practical Guide.”
Because the sweeping generalization about “pipe dreams” includes most of the machine learning honchos and honchettes, Facebook, Google, IBM, and others are probably going to pay attention. After all, Reality Engines has achieved “significant strides” with 18 people, some adviser, and money from Google’s former adult, Eric Schmidt, who invested $5.25 million.
The write up provides a glimpse of some of the ingredients in the Reality Engines’ secret sauce:
… The two pillars of the offering are “generative adversarial networks,” known as “GANs,” and “network architecture search.” Those two technologies can dramatically reduce the effort needed to build machine learning for enterprise functions, the company contends. GANs, of course, are famous for making fake faces by optimizing a competition between two neural networks based on the encoding and decoding of real images. In this case, Reality Engines has built something called a “DAGAN,” a GAN that can be used for data augmentation, the practice of making synthetic data sets when not enough data is available to train a neural network in a given domain. DAGANs were pioneered by Antreas Antoniou of the Institute for Adaptive and Neural Computation at the University of Edinburgh in 2018. The Reality Engines team has gone one better: They built a DAGAN by using network architecture search, or “NAS,” in which the computer finds the best architecture for the GAN by trying various combinations of “cells,” basic primitives composed of neural network modules.
For those not able to visualize a GAN and DGAN system, the write up includes an allegedly accurate representation of some of the Reality Engines’ components. The diagram in the write up is for another system, and authored in part by a wizard working at another firm, but let’s assume were are in the ballpark conceptually:
It appears that there is a training set. The data are fed to a DenseNet classifier and a validator. Then the DEGAN generator kicks in, processes data piped from the data sources. What’s interesting is that there are two process blocks (maybe Bayesian at its core with the good old Gaussian stuff mixed in) which “discriminate”. DarkCyber thinks this means that the system tries to reduce its margin of error for metatagging and other operations. The “Real Synthetic” block may be an error checking component, but the recipe is incomplete.
The approach is a mash up: Reality Engines’ code with software called Bananas,” presumably developed by the company Petuum and possibly experts at the University of Toronto.
How accurate is the system? DarkCyber typically ignores vendor’s assertions about accuracy. You can make up your own mind about this statement:
“The NAS-improved DAGAN improves classification accuracy on the target dataset by as much as 20.5% and can transfer between tasks,” they write.
The “reality” of most machine learning systems is that accuracy of 85 percent is attainable under quite specific conditions: Content from a bounded domain, careful construction of training data, calibration, and on-going retraining when what DarkCyber calls Bayesian drift kicks in. If a system is turned on and just used, accuracy degrades over time. At some point, the outputs are sufficiently wide of the mark that a ZDNet journalist may spot problems.
What does the system output? It seems to DarkCyber that the information in the write up focuses on classifiers. If our interpretation is narrowed to that function, content is dumped into buckets. These buckets make it easy to extract content and perform additional analysis. If each step in a work flow works, the final outs will have a greater likelihood of being “accurate” or “right.” But there are many slips between the cup and the lip as a famous plagiarizer once repeated.
What type of data can the system process? The answer is structured data, presumably cleansed and validated data.
If the Reality Engines’ approach is of interest, the company’s Web site offers a Web site with a “Request Access” button. Click it and you are probably free to test the system or kick its off road tires.
Will bananas and backpropagation be on your machine learning menu in the future?
Stephen E Arnold, January 30, 2020
Library Software Soutron Version 4.1.4 Now Available
January 17, 2020
Library automation and cataloging firm Soutron introduces its “Latest Software Update—Soutron Version 4.1.4.” The announcement describes the updates and features, complete with screenshots. The introduction reveals:
“This update provides an eagerly awaited new ‘Collections’ feature, refinements to both the Search Portal, updates to the new Admin pages and further language support. Details can be found below. These latest updates are the results of our agile development process and by working closely with, and listening to, our clients’ needs. The results are an industry leading world class library, archive and information management solution.”
Regarding that shiny new Collections feature, part of the Search Portal, we learn:
“This feature empowers users to select records from within Search Results and to assign them to a ‘Collection’. A user who is logged in may create their own Collection, adding and removing items as needed. A Collection can be easily managed, shared and organized in a tree view as shown below. This makes it easy for users, researchers or lawyers to quickly reference items of use that have been found, creating their own ‘Bento Box’ of records and materials, avoiding the need to keep performing searches or looking through saved searches for multiple relevant records.”
That does sound helpful. Other upgrades include enhanced organization for saved searches, improved viewing on mobile devices, easier search-template management, and the addition of a Default Availability status configuration option. See the write-up for more details.
Based in Derby, United Kingdom, Soutron has been creating library management systems for corporations and other organizations since 1989. The company continues to flourish by proudly embracing technological advances like automation and cloud-based systems.
Cynthia Murrell, January 16, 2020
A Taxonomy Vendor: Still Chugging Along
January 15, 2020
Semaphore Version 5 from Smartlogic coming soon.
An indexing software company— now morphed into a semantic AI outfit — Smartlogic promises Version 5 of its enterprise platform, Semaphore, will be available any time now.
The company modestly presents the announcement below the virtual fold in the company newsletter, “The Semaphore—Smartlogic’s Quarterly Newsletter—December 2019.” The General Access release should be out by the end of January. We’re succinctly informed because in indexing succinct is good:
“Semaphore 5 embodies innovative technologies and strategies to deliver a unified user experience, enhanced interoperability, and flexible integration:
*A single platform experience – modules are tightly integrated.
*Intuitive and simplified installation and administration – software can be download and configured with minimal clicks. An updated landing page allows you to quickly navigate modules and monitor status.
*Improved coupling of classification and language services, as well as improved performance.
*Updated the linguistic model and fact extraction capabilities.
*New – Document Semantic Analyzer – a performant content analyzer that provides detailed classification and language services results.
*New branding that aligns modules with capabilities and functionality.
“Semaphore 5 continues to focus around 3 core areas – Model & collaborate; fact extraction, auto-classification & language services; and integrate & visualize – in a modular platform that allows you to add capabilities as your business needs evolve. As you upgrade to Semaphore 5, you will be able to take advantage of the additional components and capabilities incorporated in your licensed modules.”
Semaphore is available on-premise, in the cloud, or a combination. Smartlogic (not to be confused with the custom app company Smartlogic) was founded in 2006 and is based in San Jose, California. The company owns SchemaLogic. Yep, we’re excited too. Maybe NLP, predictive analytics, and quantum computing technology will make a debut in this release. If not in software, perhaps in the marketing collateral?
Cynthia Murrell, January 15, 2020