IslandInText Reborn: TLDRThis

March 16, 2020

Many years ago (maybe 25+), we tested a desktop summarization tool called IslandInText. [#1 below] I believe, if my memory is working today, this was software developed in Australia by Island Software. There was a desktop version and a more robust system for large-scale summarizing of text. In the 1980s, there was quite a bit of interest in automatic summarization of text. Autonomy’s system could be configured to generate a précis if one was familiar with that system. Google’s basic citation is a modern version of what smart software can do to suggest what’s in a source item. No humans needed, of course. Too expensive and inefficient for the big folks I assume.

For many years, human abstract and indexing professionals were on staff. Our automated systems, despite their usefulness, could not handle nuances, special inclusions in source documents like graphs and tables, list of entities which we processed with the controlled term MANYCOMPANIES, and other specialized functions. I would point out that most of today’s “modern” abstracting and indexing services are simply not as good as the original services like ABI / INFORM, Chemical Abstracts, Engineering Index, Predicasts, and other pioneers in the commercial database sector. (Anyone remember Ev Brenner? That’s what I thought, gentle reader. One does not have to bother oneself with the past in today’s mobile phone search expert world.)

For a number of years, I worked in the commercial database business. In order to speed the throughput of our citations to pharmaceutical, business, and other topic domains – machine text summarization was of interest to me and my colleagues.

A reader informed me that a new service is available. It is called TLDRThis. Here’s what the splash page looks like:

image

One can paste text or provide a url, and the system returns a synopsis of the source document. (The advanced service generates a more in dept summary, but I did not test this. I am not too keen on signing up without knowing what the terms and conditions are.) There is a browser extension for the service. For this url, the system returned this summary:

Enterprise Search: The Floundering Fish!

Stephen E. Arnold Monitors Search,Content Processing,Text Mining,Related Topics His High-Tech Nerve Center In Rural Kentucky.,He Tries To Winnow The Goose Feathers The Giblets. He Works With Colleagues,Worldwide To Make This Web Log Useful To Those Who Want To Go,Beyond Search . Contact Him At Sa,At,Arnoldit.Com. His Web Site,With Additional Information About Search Is  |    Oct 27, 2011  |  Time Saved: 5 mins

  1. I am thinking about another monograph on the topic of “enterprise search.” The subject seems to be a bit like the motion picture protagonist Jason.
  2. The landscape of enterprise search is pretty much unchanged.
  3. But the technology of yesterday’s giants of enterprise search is pretty much unchanged.
  4. The reality is that the original Big Five had and still have technology rooted in the mid to late 1990s.

We noted several positive functions; for example, identifying the author and providing a synopsis of the source, even the goose feathers’ reference. On the downside, the system missed the main point of the article; that is, enterprise search has been a bit of a chimera for decades. Also, the system ignored the entities (company names) in the write up. These are important in my experience. People search for names, concepts, and events. The best synopses capture some of the entities and tell the reader to get the full list and other information from the source document. I am not sure what to make of the TLDRThis’ display of a picture which makes zero sense without the context of the full article. I fed the system a PDF which did not compute and I tried a bit.ly link which generated a request to refresh the page, not the summary.

To get an “advanced summary”, one must sign up. I did not choose to do that. I have added this site to our “follow” list. I will make a note to try and find out who developed this service.

The pricing ranges from free for basic summarization to $60 per year for Bronze level service. Among its features are 100 summaries per month and “exclusive features”. These are coming soon. The top level service is $10 per month. The fee includes 300 summaries a month and “exclusive features.” These are also coming soon. The Platinum service is $20 per month and includes 1,000 summaries per month. These are “better” and will include forthcoming advanced features.

Stay tuned.

[#1 ] In the early 1990s, search and retrieval was starting to move from the esoteric world of commercial databases to desktop and UNIX machines. IslandSoft, founded in 1993, offered a search and retrieval system. My files from this time revealed that IslandSoft’s description of its system could be reused by today’s search and retrieval marketers. Here’s what IslandSoft said about InText:

IslandInTEXT is a document retrieval and management application for PCs and Unix workstations. IslandInTEXT’s powerful document analysis engine lets users quickly access documents through plain English queries, summarize large documents based on content rather than key words, and automatically route incoming text and documents to user-defined SmartFolders. IslandInTEXT offers the strongest solution yet to help organize and utilize information with large numbers of legacy documents residing on PCs, workstations, and servers as well as the proliferation of electronic mail documents and other data. IslandInTEXT supports a number of popular word processing formats including IslandWrite, Microsoft Word, and WordPerfect plus ASCII text.

IslandInTEXT Includes:

  • File cabinet/file folder metaphor.
  • HTML conversion.
  • Natural language queries for easily locating documents.
  • Relevancy ranking of query results.
  • Document summaries based on statistical relevance from 1 to 99% of the original document—create executive summaries of large documents instantly. [This means that the user can specify how detailed the summarization was; for example, a paragraph or a page or two.]
  • Summary Options. Summaries can be based on key word selection, key word ordering, key sentences, and many more.

[For example:] SmartFolder Routing. Directs incoming text and documents to user-defined folders. Hot Link Pointers. Allow documents to be viewed in their native format without creating copies of the original documents. Heuristic/Learning Architecture. Allows InTEXT to analyze documents according to the author’s style.

A page for InText is still online as of today at http://www.intext.com/. The company appears to have ceased operations in 2010. Data in my files indicate that the name and possibly the code is owned by CP Software, but I have not verified this. I did not include InText in my first edition of Enterprise Search Report, which I wrote in 2003 and 2004. The company had falled behind market leaders Autonomy, Endeca, and Fast Search & Transfer.

I am surprised at how many search and retrieval companies today are just traveling along well worn paths in the digital landscape. Does search work? Nope. That’s why there are people who specialize, remember things, and maintain personal files. Mobile device search means precision and recall are digital dodo birds in my opinion.

Stephen E Arnold, March 16, 2020

 

Adobe PDF: Maybe as Interesting As Flash?

March 4, 2020

Adobe Portable Document Format files flashed on DarkCyber’s radar in the mid 1980s. Adobe pitched the virtues of PDF to big publishing companies. And Stephen E Arnold worked at such an organization at this time. I was given the job of examining the early version of PDF referenced by the code named Trapeze.

Trapeze artists fall to their death. Adobe Acrobat pulled off a spectacular trick, survived, became sort of open, and now seems to be a permanent part of the landscape decorated with the dumpsters burning Microsoft XPS Document Writer files.

A very good write up about the problems PDF files is FilingDB’s “What’s So Hard about PDF Text Extraction?” The information in this write up makes explicit why PDFs are not easy to manipulate, analyze, and mine.

The write up provides the data needed to understand that when a vendor says, “We process the hidden content in PDF files”, those vendors do not explain how much and what is omitted, ignored, and unindexed.

People believe that when specifying a filetype: command to Bing or Google delivers comprehensive content from PDF files. No way, sad to say. The same problem exists for any search or content processing vendor’s connectors for PDF files.

This is important when one is conducting mission critical data analysis, certain investigations, and other types of work in which “zero error” is the goal. Will the problem be remediated. Maybe, but I spotted in the 1980s, and it persists today.

Stephen E Arnold, March 4, 2020

TemaTres: Open Source Indexing Tool Updated

February 11, 2020

Open source software is the foundation for many proprietary software startups, including the open source developers themselves. Most open source software tends to lag in the manner of updates and patches, but TemaTres recently updated according to blog post, “TemaTres 3.1 Release Is Out! Open Source Web Tool To Manage Controlled Vocabularies.”

TemaTres is an open source vocabulary server designed to manage controlled vocabularies, taxonomies, and thesauri. The recent update includes the following:

“Utility for importing vocabularies encoded in MARC-XML format

  • Utility for the mass export of vocabulary in MARC-XML format
  • New reports about global vocabulary structure (ex: https://r020.com.ar/tematres/demo/sobre.php?setLang=en#global_view)
  • Distribution of terms according to depth level
  • Distribution of sum of preferred terms and the sum of alternative terms
  • Distribution of sum of hierarchical relationships and sum of associative relationships
  • Report about terms with relevant degree of centrality in the vocabulary (according to prototypical conditions)
  • Presentation of terms with relevant degree of centrality in each facet
  • New options to config the presentation of notes: define specific types of note as prominent (the others note types will be presented in collapsed div).
  • Button for Copy to clipboard the terms with indexing value (Copy-one-click button)
  • New user login scheme (login)
  • Allows to config and add Google Analytics tracking code (parameter in config.tematres.php file)
  • Improvements in standard exposure of metadata tags
  • Inclusion of the term notation or code in the search box predictive text
  • Compatibility with PHP 7.2”

TemaTres does updates frequently, but it is monitored. The main ethos about open source is to give back as much as you take. TemaTres appears to follow this modus operandi. It TemaTres wants to promote its web image, the organization should really upgrade its Web site, fix the broken links, and provide more information on what the software actually does.

Whitney Grace, February 11, 2020

Ontotext: GraphDB Update Arrives

January 31, 2020

Semantic knowledge firm Ontotext has put out an update to its graph database, The Register announces in, “It’s Just Semantics: Bulgarian Software Dev Ontotext Squeezes Out GraphDB 9.1.” Some believe graph databases are The Answer to a persistent issue. The article explains:

“The aim of applying graph database technology to enterprise data is to try to overcome the age-old problem of accessing latent organizational knowledge; something knowledge management software once tried to address. It’s a growing thing: Industry analyst Gartner said in November the application of graph databases will ‘grow at 100 per cent annually over the next few years’. GraphDB is ranked at eighth position on DB-Engines’ list of most popular graph DBMS, where it rubs shoulders with the likes of tech giants such as Microsoft, with its Azure Cosmos DB, and Amazon’s Neptune. ‘GraphDB is very good at text analytics because any natural language is very ambiguous: a project name could be a common English word, for example. But when you understand the context and how entities are connected, you can use these graph models to disambiguate the meaning,’ [GraphDB product manager Vassil] Momtchev said.”

The primary feature of this update is support for the Shapes Constraint Language, or SHACL, which the World Wide Web Consortium recommends for validating data graphs against a set of conditions. This support lets the application validate data against the schema whenever new data is loaded to the database instead of having to manually run queries to check. A second enhancement allows users to track changes in current or past database transactions. Finally, the database now supports network authentication protocol Kerberos, eliminating the need to store passwords on client computers.

Cynthia Murrell, January 31, 2020

Former Amazonian Suggests the Pre Built Models Are Pipe Dreams

January 30, 2020

I read a PR-infused write up with some interesting presumably accurate information. The article is from ZDNet.com (an outfit somewhat removed from Mr. Ziff’s executive dining room.) Its title? “Reality Engines Offers a Deep Learning Tour de Force to Challenge Amazon et al in Enterprise AI”. Here’s a passage which warranted an Amazon orange highlighter circle:

The goal, Reddy told ZDNet, is a service that “automatically creates production-ready models from data in the wild,” to ease the labor of corporations that don’t have massive teams of data scientists and deep learning programmers. “While other companies talk about offering this service, it is still largely a pipe-dream,” wrote Reddy in an email exchange with ZDNet. “We have made significant strides towards this goal,” she said.

Who will care about this assertion? Since the founder of the company is a former top dog of  “AI verticals” at Amazon’s AWS cloud service”, Amazon may care. Amazon asserts that SageMaker and related tools make machine learning easier, faster, better (cheaper may depend on one’s point of view). A positive summary of some of Amazon’s machine learning capabilities appears in “Building Fully Custom Machine Learning Models on AWS SageMaker: A Practical Guide.”

Because the sweeping generalization about “pipe dreams” includes most of the machine learning honchos and honchettes, Facebook, Google, IBM, and others are probably going to pay attention. After all, Reality Engines has achieved “significant strides” with 18 people, some adviser, and money from Google’s former adult, Eric Schmidt, who invested $5.25 million.

The write up provides a glimpse of some of the ingredients in the Reality Engines’ secret sauce:

… The two pillars of the offering are “generative adversarial networks,” known as “GANs,” and “network architecture search.” Those two technologies can dramatically reduce the effort needed to build machine learning for enterprise functions, the company contends. GANs, of course, are famous for making fake faces by optimizing a competition between two neural networks based on the encoding and decoding of real images. In this case, Reality Engines has built something called a “DAGAN,” a GAN that can be used for data augmentation, the practice of making synthetic data sets when not enough data is available to train a neural network in a given domain. DAGANs were pioneered by Antreas Antoniou of the Institute for Adaptive and Neural Computation at the University of Edinburgh in 2018. The Reality Engines team has gone one better: They built a DAGAN by using network architecture search, or “NAS,” in which the computer finds the best architecture for the GAN by trying various combinations of “cells,” basic primitives composed of neural network modules.

For those not able to visualize a GAN and DGAN system, the write up includes an allegedly accurate representation of some of the Reality Engines’ components. The diagram in the write up is for another system, and authored in part by a wizard working at another firm, but let’s assume were are in the ballpark conceptually:

image

It appears that there is a training set. The data are fed to a DenseNet classifier and  a validator. Then the DEGAN generator kicks in, processes data piped from the data sources. What’s interesting is that there are two process blocks (maybe Bayesian at its core with the good old Gaussian stuff mixed in) which “discriminate”. DarkCyber thinks this means that the system tries to reduce its margin of error for metatagging and other operations. The “Real Synthetic” block  may be an error checking component, but the recipe is incomplete.

The approach is a mash up: Reality Engines’ code with software called Bananas,” presumably developed by the company Petuum and possibly experts at the University of Toronto.

How accurate is the system? DarkCyber typically ignores vendor’s assertions about accuracy. You can make up your own mind about this statement:

“The NAS-improved DAGAN improves classification accuracy on the target dataset by as much as 20.5% and can transfer between tasks,” they write.

The “reality” of most machine learning systems is that accuracy of 85 percent is attainable under quite specific conditions: Content from a bounded domain, careful construction of training data, calibration, and on-going retraining when what DarkCyber calls Bayesian drift kicks in. If a system is turned on and just used, accuracy degrades over time. At some point, the outputs are sufficiently wide of the mark that a ZDNet journalist may spot problems.

What does the system output? It seems to DarkCyber that the information in the write up focuses on classifiers. If our interpretation is narrowed to that function, content is dumped into buckets. These buckets make it easy to extract content and perform additional analysis. If each step in a work flow works, the final outs will have a greater likelihood of being “accurate” or “right.” But there are many slips between the cup and the lip as a famous plagiarizer once repeated.

What type of data can the system process? The answer is structured data, presumably cleansed and validated data.

If the Reality Engines’ approach is of interest, the company’s Web site offers a Web site with a “Request Access” button. Click it and you are probably free to test the system or kick its off road tires.

Will bananas and backpropagation be on your machine learning menu in the future?

Stephen E Arnold, January 30, 2020

Library Software Soutron Version 4.1.4 Now Available

January 17, 2020

Library automation and cataloging firm Soutron introduces its “Latest Software Update—Soutron Version 4.1.4.” The announcement describes the updates and features, complete with screenshots. The introduction reveals:

“This update provides an eagerly awaited new ‘Collections’ feature, refinements to both the Search Portal, updates to the new Admin pages and further language support. Details can be found below. These latest updates are the results of our agile development process and by working closely with, and listening to, our clients’ needs. The results are an industry leading world class library, archive and information management solution.”

Regarding that shiny new Collections feature, part of the Search Portal, we learn:

“This feature empowers users to select records from within Search Results and to assign them to a ‘Collection’. A user who is logged in may create their own Collection, adding and removing items as needed. A Collection can be easily managed, shared and organized in a tree view as shown below. This makes it easy for users, researchers or lawyers to quickly reference items of use that have been found, creating their own ‘Bento Box’ of records and materials, avoiding the need to keep performing searches or looking through saved searches for multiple relevant records.”

That does sound helpful. Other upgrades include enhanced organization for saved searches, improved viewing on mobile devices, easier search-template management, and the addition of a Default Availability status configuration option. See the write-up for more details.

Based in Derby, United Kingdom, Soutron has been creating library management systems for corporations and other organizations since 1989. The company continues to flourish by proudly embracing technological advances like automation and cloud-based systems.

Cynthia Murrell, January 16, 2020

A Taxonomy Vendor: Still Chugging Along

January 15, 2020

Semaphore Version 5 from Smartlogic coming soon.

An indexing software company— now morphed into a semantic AI outfit — Smartlogic promises Version 5 of its enterprise platform, Semaphore, will be available any time now.

The company modestly presents the announcement below the virtual fold in the company newsletter, “The Semaphore—Smartlogic’s Quarterly Newsletter—December 2019.” The General Access release should be out by the end of January. We’re succinctly informed because in indexing succinct is good:

“Semaphore 5 embodies innovative technologies and strategies to deliver a unified user experience, enhanced interoperability, and flexible integration:

*A single platform experience – modules are tightly integrated.

*Intuitive and simplified installation and administration – software can be download and configured with minimal clicks. An updated landing page allows you to quickly navigate modules and monitor status.

*Improved coupling of classification and language services, as well as improved performance.

*Updated the linguistic model and fact extraction capabilities.

*New – Document Semantic Analyzer – a performant content analyzer that provides detailed classification and language services results.

*New branding that aligns modules with capabilities and functionality.

“Semaphore 5 continues to focus around 3 core areas – Model & collaborate; fact extraction, auto-classification & language services; and integrate & visualize – in a modular platform that allows you to add capabilities as your business needs evolve. As you upgrade to Semaphore 5, you will be able to take advantage of the additional components and capabilities incorporated in your licensed modules.”

Semaphore is available on-premise, in the cloud, or a combination. Smartlogic (not to be confused with the custom app company Smartlogic) was founded in 2006 and is based in San Jose, California. The company owns SchemaLogic. Yep, we’re excited too. Maybe NLP, predictive analytics, and quantum computing technology will make a debut in this release. If not in software, perhaps in the marketing collateral?

Cynthia Murrell, January 15, 2020

An Interesting Hypothesis about Google Indexing

January 15, 2020

We noted “Google’s Crawl-Less Index.” The main idea is that something has changed in how Google indexes. We circled in yellow this statement from the article:

[Google’ can do this now because they have a popular web browser, so they can retire their old method of discovering links and let the users do their crawling.

The statement needs context.

The speculation is that Google indexes a Web page only when a user visits a page. Google notes the behavior and indexes the page.

What’s happening, DarkCyber concludes, is that Google no longer brute force crawls the public Web. Indexing takes place when a signal (a human navigating to a page) is received. Then the page is indexed.

Is this user-behavior centric indexing a reality?

DarkCyber has noted these characteristics of Google’s indexing in the last year:

  1. Certain sites are in the Google indexes but are either not updated or updated selectively; for example, the Railway Pension Retiriement Board, MARAD, and similar sites
  2. Large sites like the Auto Channel no longer have backfiles indexed and findable unless the user resorts to Google’s advanced search syntax. Then the results display less speedily than more current content probably due to the Google caches not having infrequently accessed content in a cache close to that user
  3. Current content for many specialist sites is not available when it is published. This is a characteristic of commercial sites with unusual domains like dot co and for some blogs.

What’s going on? DarkCyber believes that Google is trying to reduce the increasing and very difficult to control costs associated with indexing new content, indexing updated content (the deltas), and indexing the complicated content which Web sites generate in chasing the dream of becoming number one for a Google query.

Search efficiency, as we have documented in our write ups, books, and columns about Google, boils down to:

  1. Maximizing advertising value. That’s one reason why query expansion is used. Results match more ads and, thus, the advertiser’s ads get broader exposure.
  2. Getting away from the old school approach of indexing the billions of Web pages. 90 percent of these Web pages get zero traffic; therefore, index only what’s actually wanted by users. Today’s Google is not focused on library science, relevance, precision, and recall.
  3. Cutting costs. Cost control at the Google is very, very difficult. The crazy moonshots, the free form approach to management, the need for legions of lawyers and contract workers, the fines, the technical debt of a 20 year old company, the salaries, and the extras—each of these has to be controlled. The job is difficult.

Net net: Even wonder why finding specific information is getting more difficult via Google? Money.

PS: Finding timely, accurate information and obtaining historical content are more difficult, in DarkCyber’s experience, than at any time since we sold our ThePoint service to Lycos in the mid 1990s.

Stephen E Arnold, January 15, 2020

Intellisophic: Protected Content

December 28, 2019

Curious about Intellisophic? If you navigate to www.intellisophic.com, you get this page. If you know that Intellisophic operates from www.intellisophic.com, you get a live Web site that looks like this:

image

No links, and there is no indication who operates this page.

You persevere and locate a link to the “real” Intellisophic. You spot the About page and click it. What renders?

image

Yep, protected information.

Even companies providing specialized services to governments with “interesting” investors and solutions, provides a tiny bit of information; for example, check out https://voyagerlabs.co/.

DarkCyber finds it interesting that a company in the information business, does not provide any information about itself.

Stephen E Arnold, December 28, 2019

Instagram Learns about Uncontrolled Indexing

December 23, 2019

Everyone is an expert on search. Everyone can assign index terms, often called metatags or hashtags. The fun world of indexing at this time usually means anyone can make up a “tag” and assign it. This is uncontrolled indexing. The popularity of the method is a result of two things: A desire to save money. Skilled indexers want to develop controlled vocabularies and guidelines for the use of those terms. These activities are expensive, and in MBA land who cares? A second reason is that without an editorial policy and editorial controls, MBAs and engineers can say, “Hey, Boomer, we just provide a platform. Not our problem.”

Not surprisingly even some millennials are figuring out that old school indexing has some value, despite the burden of responsibility. Responsible behavior builds a few ethical muscles.

“How Anti-Vaxxers Get around Instagram’s New Hashtag Controls” reveals some of the flaws of uncontrolled indexing and the shallowness of the solutions crafted by some thumb typing content professionals. This passage explains the not too tough method in use by some individuals:

But anti-vaccine Instagram users have been getting around the controls by employing more than 40 cryptic hashtags such as #learntherisk and #justasking.

There you go. Make up a new indexing term and share it with your follow travelers. Why not use wonky spelling or an odd ball character?

The write up exposes the limitations of rules based term filtering and makes clear that artificial intelligence is not showing up for required office hours.

Should I review the benefits of controlled term indexing? Yes.

Will I? No.

Why? Today no one cares.

Who needs old fashioned methods? No one who wants his or her bonus.

Stephen E Arnold, December 23, 2019

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta