TemaTres: Open Source Indexing Tool Updated

February 11, 2020

Open source software is the foundation for many proprietary software startups, including the open source developers themselves. Most open source software tends to lag in the manner of updates and patches, but TemaTres recently updated according to blog post, “TemaTres 3.1 Release Is Out! Open Source Web Tool To Manage Controlled Vocabularies.”

TemaTres is an open source vocabulary server designed to manage controlled vocabularies, taxonomies, and thesauri. The recent update includes the following:

“Utility for importing vocabularies encoded in MARC-XML format

  • Utility for the mass export of vocabulary in MARC-XML format
  • New reports about global vocabulary structure (ex: https://r020.com.ar/tematres/demo/sobre.php?setLang=en#global_view)
  • Distribution of terms according to depth level
  • Distribution of sum of preferred terms and the sum of alternative terms
  • Distribution of sum of hierarchical relationships and sum of associative relationships
  • Report about terms with relevant degree of centrality in the vocabulary (according to prototypical conditions)
  • Presentation of terms with relevant degree of centrality in each facet
  • New options to config the presentation of notes: define specific types of note as prominent (the others note types will be presented in collapsed div).
  • Button for Copy to clipboard the terms with indexing value (Copy-one-click button)
  • New user login scheme (login)
  • Allows to config and add Google Analytics tracking code (parameter in config.tematres.php file)
  • Improvements in standard exposure of metadata tags
  • Inclusion of the term notation or code in the search box predictive text
  • Compatibility with PHP 7.2”

TemaTres does updates frequently, but it is monitored. The main ethos about open source is to give back as much as you take. TemaTres appears to follow this modus operandi. It TemaTres wants to promote its web image, the organization should really upgrade its Web site, fix the broken links, and provide more information on what the software actually does.

Whitney Grace, February 11, 2020

Ontotext: GraphDB Update Arrives

January 31, 2020

Semantic knowledge firm Ontotext has put out an update to its graph database, The Register announces in, “It’s Just Semantics: Bulgarian Software Dev Ontotext Squeezes Out GraphDB 9.1.” Some believe graph databases are The Answer to a persistent issue. The article explains:

“The aim of applying graph database technology to enterprise data is to try to overcome the age-old problem of accessing latent organizational knowledge; something knowledge management software once tried to address. It’s a growing thing: Industry analyst Gartner said in November the application of graph databases will ‘grow at 100 per cent annually over the next few years’. GraphDB is ranked at eighth position on DB-Engines’ list of most popular graph DBMS, where it rubs shoulders with the likes of tech giants such as Microsoft, with its Azure Cosmos DB, and Amazon’s Neptune. ‘GraphDB is very good at text analytics because any natural language is very ambiguous: a project name could be a common English word, for example. But when you understand the context and how entities are connected, you can use these graph models to disambiguate the meaning,’ [GraphDB product manager Vassil] Momtchev said.”

The primary feature of this update is support for the Shapes Constraint Language, or SHACL, which the World Wide Web Consortium recommends for validating data graphs against a set of conditions. This support lets the application validate data against the schema whenever new data is loaded to the database instead of having to manually run queries to check. A second enhancement allows users to track changes in current or past database transactions. Finally, the database now supports network authentication protocol Kerberos, eliminating the need to store passwords on client computers.

Cynthia Murrell, January 31, 2020

Former Amazonian Suggests the Pre Built Models Are Pipe Dreams

January 30, 2020

I read a PR-infused write up with some interesting presumably accurate information. The article is from ZDNet.com (an outfit somewhat removed from Mr. Ziff’s executive dining room.) Its title? “Reality Engines Offers a Deep Learning Tour de Force to Challenge Amazon et al in Enterprise AI”. Here’s a passage which warranted an Amazon orange highlighter circle:

The goal, Reddy told ZDNet, is a service that “automatically creates production-ready models from data in the wild,” to ease the labor of corporations that don’t have massive teams of data scientists and deep learning programmers. “While other companies talk about offering this service, it is still largely a pipe-dream,” wrote Reddy in an email exchange with ZDNet. “We have made significant strides towards this goal,” she said.

Who will care about this assertion? Since the founder of the company is a former top dog of  “AI verticals” at Amazon’s AWS cloud service”, Amazon may care. Amazon asserts that SageMaker and related tools make machine learning easier, faster, better (cheaper may depend on one’s point of view). A positive summary of some of Amazon’s machine learning capabilities appears in “Building Fully Custom Machine Learning Models on AWS SageMaker: A Practical Guide.”

Because the sweeping generalization about “pipe dreams” includes most of the machine learning honchos and honchettes, Facebook, Google, IBM, and others are probably going to pay attention. After all, Reality Engines has achieved “significant strides” with 18 people, some adviser, and money from Google’s former adult, Eric Schmidt, who invested $5.25 million.

The write up provides a glimpse of some of the ingredients in the Reality Engines’ secret sauce:

… The two pillars of the offering are “generative adversarial networks,” known as “GANs,” and “network architecture search.” Those two technologies can dramatically reduce the effort needed to build machine learning for enterprise functions, the company contends. GANs, of course, are famous for making fake faces by optimizing a competition between two neural networks based on the encoding and decoding of real images. In this case, Reality Engines has built something called a “DAGAN,” a GAN that can be used for data augmentation, the practice of making synthetic data sets when not enough data is available to train a neural network in a given domain. DAGANs were pioneered by Antreas Antoniou of the Institute for Adaptive and Neural Computation at the University of Edinburgh in 2018. The Reality Engines team has gone one better: They built a DAGAN by using network architecture search, or “NAS,” in which the computer finds the best architecture for the GAN by trying various combinations of “cells,” basic primitives composed of neural network modules.

For those not able to visualize a GAN and DGAN system, the write up includes an allegedly accurate representation of some of the Reality Engines’ components. The diagram in the write up is for another system, and authored in part by a wizard working at another firm, but let’s assume were are in the ballpark conceptually:

image

It appears that there is a training set. The data are fed to a DenseNet classifier and  a validator. Then the DEGAN generator kicks in, processes data piped from the data sources. What’s interesting is that there are two process blocks (maybe Bayesian at its core with the good old Gaussian stuff mixed in) which “discriminate”. DarkCyber thinks this means that the system tries to reduce its margin of error for metatagging and other operations. The “Real Synthetic” block  may be an error checking component, but the recipe is incomplete.

The approach is a mash up: Reality Engines’ code with software called Bananas,” presumably developed by the company Petuum and possibly experts at the University of Toronto.

How accurate is the system? DarkCyber typically ignores vendor’s assertions about accuracy. You can make up your own mind about this statement:

“The NAS-improved DAGAN improves classification accuracy on the target dataset by as much as 20.5% and can transfer between tasks,” they write.

The “reality” of most machine learning systems is that accuracy of 85 percent is attainable under quite specific conditions: Content from a bounded domain, careful construction of training data, calibration, and on-going retraining when what DarkCyber calls Bayesian drift kicks in. If a system is turned on and just used, accuracy degrades over time. At some point, the outputs are sufficiently wide of the mark that a ZDNet journalist may spot problems.

What does the system output? It seems to DarkCyber that the information in the write up focuses on classifiers. If our interpretation is narrowed to that function, content is dumped into buckets. These buckets make it easy to extract content and perform additional analysis. If each step in a work flow works, the final outs will have a greater likelihood of being “accurate” or “right.” But there are many slips between the cup and the lip as a famous plagiarizer once repeated.

What type of data can the system process? The answer is structured data, presumably cleansed and validated data.

If the Reality Engines’ approach is of interest, the company’s Web site offers a Web site with a “Request Access” button. Click it and you are probably free to test the system or kick its off road tires.

Will bananas and backpropagation be on your machine learning menu in the future?

Stephen E Arnold, January 30, 2020

Library Software Soutron Version 4.1.4 Now Available

January 17, 2020

Library automation and cataloging firm Soutron introduces its “Latest Software Update—Soutron Version 4.1.4.” The announcement describes the updates and features, complete with screenshots. The introduction reveals:

“This update provides an eagerly awaited new ‘Collections’ feature, refinements to both the Search Portal, updates to the new Admin pages and further language support. Details can be found below. These latest updates are the results of our agile development process and by working closely with, and listening to, our clients’ needs. The results are an industry leading world class library, archive and information management solution.”

Regarding that shiny new Collections feature, part of the Search Portal, we learn:

“This feature empowers users to select records from within Search Results and to assign them to a ‘Collection’. A user who is logged in may create their own Collection, adding and removing items as needed. A Collection can be easily managed, shared and organized in a tree view as shown below. This makes it easy for users, researchers or lawyers to quickly reference items of use that have been found, creating their own ‘Bento Box’ of records and materials, avoiding the need to keep performing searches or looking through saved searches for multiple relevant records.”

That does sound helpful. Other upgrades include enhanced organization for saved searches, improved viewing on mobile devices, easier search-template management, and the addition of a Default Availability status configuration option. See the write-up for more details.

Based in Derby, United Kingdom, Soutron has been creating library management systems for corporations and other organizations since 1989. The company continues to flourish by proudly embracing technological advances like automation and cloud-based systems.

Cynthia Murrell, January 16, 2020

A Taxonomy Vendor: Still Chugging Along

January 15, 2020

Semaphore Version 5 from Smartlogic coming soon.

An indexing software company— now morphed into a semantic AI outfit — Smartlogic promises Version 5 of its enterprise platform, Semaphore, will be available any time now.

The company modestly presents the announcement below the virtual fold in the company newsletter, “The Semaphore—Smartlogic’s Quarterly Newsletter—December 2019.” The General Access release should be out by the end of January. We’re succinctly informed because in indexing succinct is good:

“Semaphore 5 embodies innovative technologies and strategies to deliver a unified user experience, enhanced interoperability, and flexible integration:

*A single platform experience – modules are tightly integrated.

*Intuitive and simplified installation and administration – software can be download and configured with minimal clicks. An updated landing page allows you to quickly navigate modules and monitor status.

*Improved coupling of classification and language services, as well as improved performance.

*Updated the linguistic model and fact extraction capabilities.

*New – Document Semantic Analyzer – a performant content analyzer that provides detailed classification and language services results.

*New branding that aligns modules with capabilities and functionality.

“Semaphore 5 continues to focus around 3 core areas – Model & collaborate; fact extraction, auto-classification & language services; and integrate & visualize – in a modular platform that allows you to add capabilities as your business needs evolve. As you upgrade to Semaphore 5, you will be able to take advantage of the additional components and capabilities incorporated in your licensed modules.”

Semaphore is available on-premise, in the cloud, or a combination. Smartlogic (not to be confused with the custom app company Smartlogic) was founded in 2006 and is based in San Jose, California. The company owns SchemaLogic. Yep, we’re excited too. Maybe NLP, predictive analytics, and quantum computing technology will make a debut in this release. If not in software, perhaps in the marketing collateral?

Cynthia Murrell, January 15, 2020

An Interesting Hypothesis about Google Indexing

January 15, 2020

We noted “Google’s Crawl-Less Index.” The main idea is that something has changed in how Google indexes. We circled in yellow this statement from the article:

[Google’ can do this now because they have a popular web browser, so they can retire their old method of discovering links and let the users do their crawling.

The statement needs context.

The speculation is that Google indexes a Web page only when a user visits a page. Google notes the behavior and indexes the page.

What’s happening, DarkCyber concludes, is that Google no longer brute force crawls the public Web. Indexing takes place when a signal (a human navigating to a page) is received. Then the page is indexed.

Is this user-behavior centric indexing a reality?

DarkCyber has noted these characteristics of Google’s indexing in the last year:

  1. Certain sites are in the Google indexes but are either not updated or updated selectively; for example, the Railway Pension Retiriement Board, MARAD, and similar sites
  2. Large sites like the Auto Channel no longer have backfiles indexed and findable unless the user resorts to Google’s advanced search syntax. Then the results display less speedily than more current content probably due to the Google caches not having infrequently accessed content in a cache close to that user
  3. Current content for many specialist sites is not available when it is published. This is a characteristic of commercial sites with unusual domains like dot co and for some blogs.

What’s going on? DarkCyber believes that Google is trying to reduce the increasing and very difficult to control costs associated with indexing new content, indexing updated content (the deltas), and indexing the complicated content which Web sites generate in chasing the dream of becoming number one for a Google query.

Search efficiency, as we have documented in our write ups, books, and columns about Google, boils down to:

  1. Maximizing advertising value. That’s one reason why query expansion is used. Results match more ads and, thus, the advertiser’s ads get broader exposure.
  2. Getting away from the old school approach of indexing the billions of Web pages. 90 percent of these Web pages get zero traffic; therefore, index only what’s actually wanted by users. Today’s Google is not focused on library science, relevance, precision, and recall.
  3. Cutting costs. Cost control at the Google is very, very difficult. The crazy moonshots, the free form approach to management, the need for legions of lawyers and contract workers, the fines, the technical debt of a 20 year old company, the salaries, and the extras—each of these has to be controlled. The job is difficult.

Net net: Even wonder why finding specific information is getting more difficult via Google? Money.

PS: Finding timely, accurate information and obtaining historical content are more difficult, in DarkCyber’s experience, than at any time since we sold our ThePoint service to Lycos in the mid 1990s.

Stephen E Arnold, January 15, 2020

Intellisophic: Protected Content

December 28, 2019

Curious about Intellisophic? If you navigate to www.intellisophic.com, you get this page. If you know that Intellisophic operates from www.intellisophic.com, you get a live Web site that looks like this:

image

No links, and there is no indication who operates this page.

You persevere and locate a link to the “real” Intellisophic. You spot the About page and click it. What renders?

image

Yep, protected information.

Even companies providing specialized services to governments with “interesting” investors and solutions, provides a tiny bit of information; for example, check out https://voyagerlabs.co/.

DarkCyber finds it interesting that a company in the information business, does not provide any information about itself.

Stephen E Arnold, December 28, 2019

Instagram Learns about Uncontrolled Indexing

December 23, 2019

Everyone is an expert on search. Everyone can assign index terms, often called metatags or hashtags. The fun world of indexing at this time usually means anyone can make up a “tag” and assign it. This is uncontrolled indexing. The popularity of the method is a result of two things: A desire to save money. Skilled indexers want to develop controlled vocabularies and guidelines for the use of those terms. These activities are expensive, and in MBA land who cares? A second reason is that without an editorial policy and editorial controls, MBAs and engineers can say, “Hey, Boomer, we just provide a platform. Not our problem.”

Not surprisingly even some millennials are figuring out that old school indexing has some value, despite the burden of responsibility. Responsible behavior builds a few ethical muscles.

“How Anti-Vaxxers Get around Instagram’s New Hashtag Controls” reveals some of the flaws of uncontrolled indexing and the shallowness of the solutions crafted by some thumb typing content professionals. This passage explains the not too tough method in use by some individuals:

But anti-vaccine Instagram users have been getting around the controls by employing more than 40 cryptic hashtags such as #learntherisk and #justasking.

There you go. Make up a new indexing term and share it with your follow travelers. Why not use wonky spelling or an odd ball character?

The write up exposes the limitations of rules based term filtering and makes clear that artificial intelligence is not showing up for required office hours.

Should I review the benefits of controlled term indexing? Yes.

Will I? No.

Why? Today no one cares.

Who needs old fashioned methods? No one who wants his or her bonus.

Stephen E Arnold, December 23, 2019

From the Desk of Captain Obvious: How Image Recognition Mostly Works

July 8, 2019

Want to be reminded about how super duper image recognition systems work? If so, navigate to the capitalist’s tool “Facebook’s ALT Tags Remind Us That Deep Learning Still Sees Images as Keywords.” The DarkCyber teams knows that this headline is designed to capture clicks and certainly does not apply to every image recognition system available. But if the image is linked via metadata to something other than a numeric code, then images are indeed mapped to words. Words, it turns out, remain useful in our video and picture first world.

Nevertheless, the write up offers some interesting comments, which is what the DarkCyber research team expects from the capitalist tool. (One of our DarkCyber team saw Malcolm Forbes at a Manhattan eatery keeping a close eye on a spectacularly gaudy motorcycle. Alas, that Mr. Forbes is no longer with us, although the motorcycle probably survives somewhere unlike the “old” Forbes’ editorial policies.

Here’s the passage:

For all the hype and hyperbole about the AI revolution, today’s best deep learning content understanding algorithms are still remarkably primitive and brittle. In place of humans’ rich semantic understanding of imagery, production image recognition algorithms see images merely through predefined galleries of metadata tags they apply based on brittle and naïve correlative models that are trivially confused.

Yep, and ultimately the hundreds of millions of driver license pictures will be mapped to words; for example, name, address, city, state, zip, along with a helpful pointer to other data about the driver.

The capitalist tool reminds the patient reader:

Today’s deep learning algorithms “see” imagery by running it through a set of predefined models that look for simple surface-level correlative patterns in the arrangement of its pixels and output a list of subject tags much like those human catalogers half a century ago.

Once again, no push back from Harrod’s Creek. However, it is disappointing that new research is not referenced in the article; for example, the companies involved in Darpa Upside.

Stephen E Arnold, July 8, 2019

How Smart Software Goes Off the Rails

June 23, 2019

Navigate to “How Feature Extraction Can Be Improved With Denoising.” The write up seems like a straight forward analytics explanation. Lots of jargon, buzzwords, and hippy dippy references to length squared sampling in matrices. The concept is not defined in the article. And if you remember statistics 101, you know that there are five types of sampling: Convenience, cluster, random, systematic, and stratified. Each has its strengths and weaknesses. How does one avoid the issues? Use length squared sampling obviously: Just sample rows with probability proportional to the square of their Euclidean norms. Got it?

However, the math is not the problem. Math is a method. The glitch is in defining “noise.” Like love, there are many ways to define love. The write up points out:

Autoencoders with more hidden layers than inputs run the risk of learning the identity function – where the output simply equals the input – thereby becoming useless. In order to overcome this, Denoising Autoencoders(DAE) was developed. In this technique, the input is randomly induced by noise. This will force the autoencoder to reconstruct the input or denoise. Denoising is recommended as a training criterion for learning to extract useful features that will constitute a better higher level representation.

Can you spot the flaw in approach? Consider what happens if the training set is skewed for some reason. The system will learn based on the inputs smoothed by statistical sanding. When the system encounters real world data, the system will, by golly, convert the “real” inputs in terms of the flawed denoising method. As one wit observed, “So s?c^2 p gives us a better estimation than the zero matrix.” Yep.

To sum up, the system just generates “drifting” outputs. The fix? Retraining. This is expensive and time consuming. Not good when the method is applied to real time flows of data.

In a more colloquial turn of phrase, the denoiser may not be denoising correctly.

A more complex numerical recipes are embedded in “smart” systems, there will be some interesting consequences. Does the phrase “chain of failure”? What about “good enough”?

Stephen E Arnold, June 23, 2019

Next Page »

  • Archives

  • Recent Posts

  • Meta