Natural Language Generation: Sort of Made Clear

February 28, 2019

I don’t want to spend too much time on NGA (natural language generation). This is a free Web log. Providing the acronym should be enough of a hint.

If you are interested in the subject and can deal with wonky acronyms, you may want to read “Beyond Local Pattern Matching: Recent Advances in Machine Reading.”

Search sucks, so bright young minds want to tell you what you need to know. What if the system is only 75 to 80 percent accurate? The path is a long one, but the direction information retrieval is heading seems clear.

Stephen E Arnold, February 28, 2019

Written by Stephen E. Arnold · Filed Under AI, Indexing, News | Comments Off on Natural Language Generation: Sort of Made Clear

ChemNet: Pre Training and Rules Can Work but Time and Cost Can Be a Roadblock

February 27, 2019

I read “New AI Approach Bridges the Slim Data Gap That Can Stymie Deep Learning Approaches.” The phrase “slim data” caught my attention. Pairing the phrase with “deep learning” seemed to point the way to the future.

The method described in the document reminded me that creating rules for “smart software” works on narrow domains with constraints on terminology. No emojis allowed. The method of “pre training” has been around since the early days of smart software. Autonomy in the mid 1990s relied upon training its “black box.”

Creating a training set which represents the content to be processed or indexed can be a time consuming, expensive business. Plus because content “drifts”, re-training is required. For some types of content, the training process must be repeated and verified.

So the cost of the rule creation, tuning and tweaking is one thing. The expense of training, training set tuning, and retraining is another. Add them up, and the objective of keeping costs down and accuracy up becomes a bit of a challenge.

The article focuses on the benefits of the new system as it crunches and munches its way through chemical data. The idea is to let software identify molecules for their toxicity.

Why hasn’t this type of smart software been used to index outputs at scale?

My hunch is that the time, cost, and accuracy of the indexing itself is a challenge. Eighty percent accuracy may be okay for some applications like identifying patients with a risk of diabetes. For identifying substances that will not kill one outright is another.

In short, the slim data gap and deep learning remain largely unsolved even for a constrained content domain.

Stephen E Arnold, February 27, 2019

Written by Stephen E. Arnold · Filed Under AI, Indexing, News | 1 Comment

Google Book Search: Broken Unfixable under Current Incentives

February 19, 2019

I read “How Badly is Google Books Search Broken, and Why?” The main point is that search results do not include the expected results. The culprit, as I understand the write up, looking for rare strings of characters within a time slice behaves in an unusual manner. I noted this statement:

So possibly Google has one year it displays for books online as a best guess, and another it uses internally to represent the year they have legal certainty a book is released. So maybe those volumes of the congressional record have had their access rolled back as Google realized that 1900 might actually mean 1997; and maybe Google doesn’t feel confident in library metadata for most of its other books, and doesn’t want searchers using date filters to find improperly released books. Oddly, this pattern seems to work differently on other searches. Trying to find another rare-ish term in Google Ngrams, I settled on “rarely used word”; the Ngrams database lists 192 uses before 2002. Of those, 22 show up in the Google index. A 90% disappearance rate is bad, but still a far cry from 99.95%.

There are many reasons one can identify for the apparent misbehavior of the Google search system for books. The author identifies the main reason but does not focus on it.

From my point of view and based on the research we have done for my various Google monographs, Google’s search systems operate in silos. But each shares some common characteristics even though the engineers, often reluctantly assigned to what are dead end or career stalling projects, make changes.

One of the common flaws has to do with the indexing process itself. None of the Google silos does a very good job with time related information. Google itself has a fix, but implementing the fix for most of its services is a cost increasing step.

The result is that Google focuses on innovations which can drive revenue; that is, online advertising for the mobile user of Google services.

But Google’s time blindness is unlikely to be remediated any time soon. For a better implementation of sophisticated time operations, take a look at the technology for time based retrieval, time slicing, and time analytics from the Google and In-Q-Tel funded company Recorded Future.

In my lectures about Google’s time blindness DNA, I compare and contrast what Recorded Future can do versus what Google silos are doing.

Net net: Performing sophisticated analyses of the Google indexes requires the type of tools available from Recorded Future.

Stephen E Arnold, February 19, 2019

Written by Stephen E. Arnold · Filed Under Google, Indexing, News | Comments Off on Google Book Search: Broken Unfixable under Current Incentives

Amazon: Wheel Re-Invention

December 19, 2018

Some languages have bound phrases; that is, two words which go together. Examples include “White House”, a presidential dwelling, and “ticket counter”, a place to talk with an uninterested airline professionals. How does a smart software system recognize a bound phrase and then connect it to the speaker’s or writer’s intended meaning. There is a difference between “I toured the White House” and “Turn left at the white house.”

Traditionally, vendors of text analysis, indexing, and NLP systems used jargon to explain a collection of methods pressed into action to make sense of language quirks. The guts of most systems are word lists, training material selected to make clear that in certain contexts some words go together and have a specific meaning; for example, “terminal” doesn’t make much sense until one gets whether the speaker or writer is referencing a place to board a train (railroad terminal), the likely fate of a sundowner (terminal as in dead), or a computer interface device (dumb terminal).

How does Amazon accomplish this magic? Amazon embraces jargon, of course, and then explains its bound phrase magic in “How Alexa Knows “Peanut Butter” Is One Shopping-List Item, Not Two.”

Amazon’s spin is spoken language understanding. The write up explains how the system operates. But the methods are ones that others have used. Amazon, to be sure, has tweaked the procedures. That’s standard operating procedure in the index game.

What’s interesting is that no reference is made to the contextual information which Amazon has to assist its smart software with disambiguation.

But Amazon is now talking, presumably to further the message that the company is a bold, brave innovator.

No argument from Harrod’s Creek. That’s a bound phrase, by the way, with capital letters and sometimes and apostrophe or not.

Stephen E Arnold, December 19, 2018

Written by Stephen E. Arnold · Filed Under Alexa, Amazon, Indexing, News | Comments Off on Amazon: Wheel Re-Invention

Facial Recognition and Image Recognition: Nervous Yet?

November 18, 2018

I read “A New Arms Race: How the U.S. Military Is Spending Millions to Fight Fake Images.” The write up contained an interesting observation from an academic wizard:

“The nightmare situation is a video of Trump saying I’ve launched nuclear weapons against North Korea and before anybody figures out that it’s fake, we’re off to the races with a global nuclear meltdown.” — Hany Farid, a computer science professor at Dartmouth College

Nothing like a shocking statement to generate fear.

But there is a more interesting image recognition observation. “Facebook Patent Uses Your Family Photos For Targeted Advertising” reports that a the social media sparkler has an invention that will

attempt to identify the people within your photo to try and guess how many people are in your family, and what your relationships are with them. So for example if it detects that you are a parent in a household with young children, then it might display ads that are more suited for such family units. [US20180332140]

While considering the implications of pinpointing family members and linking the deduced and explicit data, consider that one’s fingerprint can be duplicated. The dupe allows a touch ID to be spoofed. You can get the details in “AI Used To Create Synthetic Fingerprints, Fools Biometric Scanners.”

For a law enforcement and intelligence angle on image recognition, watch for DarkCyber on November 27, 2018. The video will be available on the Beyond Search blog splash page at this link.

Stephen E Arnold, November 18, 2018

Written by Stephen E. Arnold · Filed Under Entity extraction, Facebook, Image search, Indexing, News, Rich media | 3 Comments

Google Struggles with Indexing?

November 14, 2018

You probably know that Google traffic was routed to China. The culprit was something obvious. In this case, Nigeria. Yep, Nigeria. You can read about the mistake that provided some interesting bits and bytes to the Middle Kingdom. Yeah, I know. Nigeria. “A Nigerian Company Is in Trouble with Google for Re-Routing Traffic to Russia, China” provides some allegedly accurate information.

But the major news I noted here in Harrod’s Creek concerned Google News and its indexing. Your experience may be different from mine, but Google indexing can be interesting. I was looking for an outfit identified as Inovatio, which is a university anchored outfit in China. The reference to Inovatio in Google aimed me at a rock band and a design company in Slovenia. Google’s smart search system changed Inovatio to innovation even when I used quote marks. I did locate the Inovatio operation using a Chinese search engine. I was able to track Ampthon.com which listed Inovatio and provided the university affiliation to allow me to get some info about an outfit providing surveillance and intercept services to countries in need of this capability.

Google. Indexing. Yeah.

“Google News Publishers Complaining About Indexing Issues” highlights another issue with the beloved Google. I learned:

In the past few days there has been an uptick in complaints from Google News publishers around Google not indexing their new news content. Gary Illyes from Google did a rare appearance on Twitter to say he passed along the feedback to the Google News team to investigate. You can scan through the Google News Help forums and see a nice number of complaints. Also David Esteve, the SEO at the Spanish newspaper El Confidencial, posted his concerns on Twitter.

The good news is that the write up mentions that this indexing glitch is a known issue.

Net net: Many people with whom I speak believe that Google’s index is comprehensive, timely, and consistent.

Yeah, also smart because Inovatio is really innovation.

Stephen E Arnold, November 14, 2018

Written by Stephen E. Arnold · Filed Under Google, Indexing, News | 3 Comments

Indexing Matters: The Investment Sector Analysis

October 15, 2018

I read reports which explain why large monopolistic or oligopolistic companies alter the behavior of certain ecosystems. I don’t see that many because analysts are preoccupied with more practical matters; namely, their bonuses, appearances on Bloomberg TV or CNBC, and riding their hobby horses.

I read and then reread “Platform Giants and Venture Backed Startups.” The premise struck me as obvious. The whales of online are functioning like giant electromagnets. There companies pull traffic, attention, and money. At the same time, they emit beacons which are tuned to the inner ears of investors.

Image result for jello cubed dessert

Looks tasty but only semi organized. And from what is this confection fabricated? Answer: Cow hooves. Intellectual Jello, lovingly crafted to delight the eye.

The squeaks of these ultra high frequency waves alert those looking for big paydays to put their money into startups which do not compete head on with the outfits operating like electromagnets.

The “Platform Giant” write up assembles observations from a report which asserts the opposite; that is, big electromagnets do not have an impact on start ups and most investors.

Put that aside.

The core of the write up makes clear that indexing and classification make a difference. The idea is that if one classifies and marshals data, the classification creates a way to look at the data, the world, and in this particular case the way investments flow or do not flow.

What goes in “Internet software” becomes the trigger for the conclusion. Invest to compete against the Google? Not a good idea.

The question becomes, “Who does the indexing, classification, ontology, and related bits of the taxonomy?”

Indexing is important. But more important is the creation of the knowledge structure and the categories which will be used to chop, slice, and organize data for analysis.

Get the knowledge structure wrong and the flawed categorization creates findings that are probably misleading at best and just off base.

Who takes the time to work out the knowledge structure before training humans and smart software to assign metadata?

The write up suggests that humans (either with agenda or without, with expertise or not, or with a wonky knowledge superstructure or not) do.

Net net: Counting is verifiable. Pegging what to count may be more like organizing cubes of a gelatin dessert.

Stephen E Arnold, October 15, 2018

Written by Stephen E. Arnold · Filed Under Indexing, News | Comments Off on Indexing Matters: The Investment Sector Analysis

The Semantic Web: Technology Roadkill or a Roadside Snack?

September 24, 2018

I spotted a quote to note. Here it is:

The Semantic Web is as dead as last year’s roadkill.

The statement appears in “Whatever Happened to the Semantic Web?” The write up provides a run through of the starts and stops associated with making the Web into a more organized place.

I would point out that the state of the Semantic Web can be glimpsed in the TweetedTimes’ auto generated list of articles called “Semantic Search.” The collection of items focuses on a range of topics, but the thrust seems to be getting traffic for a Web site; for example, “How to Optimize Content for Semantic SEO.”

If you are an adherent of the Semantic Web, check out the included footnotes. I would point out that the Google has a number of Guha patents in its portfolio. I think the Semantic Web may be of interest to some at the online ad search giant.

Guha’s patents plus the work by Alon Halevy may suggest some interesting use cases for the mark up, triplet, smart agent system and methods.

Stephen E Arnold, September 24, 2018

Written by Stephen E. Arnold · Filed Under Google, Indexing, News | 3 Comments

Bing: No More Public URL Submissions

September 19, 2018

Ever wondered why some Web site content is not indexed? Heck, ever talk to a person who cannot find their Web site in a “free” Web index? I know that many people believe that “free” Web search services are comprehensive. Here’s a thought: The Web indexes are not comprehensive. The indexing is selective, disconnected from meaningful date and time stamps, and often limited to following links to a specified depth; for example, three levels down or fewer in many cases.

I thought about the perception of comprehensiveness when I read “Bing Is Removing Its Public URL Submission Tool.” The tool allowed a savvy SEO professional or an informed first time Web page creator to let Bing know that a site was online and ready for indexing.

No more.

How do “free” Web indexes find new sites? Now that’s a good question, and the answers range from “I don’t know” or “Bing and Google are just able to find these sites.”

A couple of thoughts:

Editorial or spidering policies are not spelled out by most Web indexing outfits
Users assume that if information is available online, that information is accurate
“Free” Web indexing services are not set up to deliver results that are necessarily timely (indexed on a daily basis) or comprehensive.

Bing’s allegedly turning off public url submissions is a small thing. My question, “Who looked at these submissions and made a decision about what to index or exclude from indexing?” Perhaps the submission form operated like a thermostat control in a hotel room?

Stephen E Arnold, September 18, 2018

Written by Stephen E. Arnold · Filed Under Indexing, Microsoft, News | Comments Off on Bing: No More Public URL Submissions

Semantic Struggles and Metadata

August 31, 2018

I have noticed the flood of links and social media posts about semantics from David Amerland. I found many of the observations interesting; a few struck me as a wildly different view of indexing. A recent essay by David Amerland “Snipers Use Metadata Much Like Semantic Search Does” caught the Beyond Search team’s attention.

Learn about “The Sniper Mind” at this link.

According to the story:

“There are two key takeaways here [about metadata and trained killers]: First, such skills are directly transferable in the business domain and even in most life situations. Second, in order to use their brain in this way snipers need training. The mental training and the psychological aids that are developed as a result of it is what I detailed…”

We must admit that it is a fresh metaphor: Comparing killers’ use of indexing with semantic search. In our experience with professional indexing systems and human indexers, the word “sniper” has not to our recollection been used.

Watch your back, your blindside, or ontology. Oh, also metaphors.

Patrick Roland, August 31, 2018

Written by Stephen E. Arnold · Filed Under Indexing, News | 1 Comment

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.

Categories
- 3D-Printing
- Acquisition
- Advertising
- Aggregation
- AI
- Alexa
- algorithms
- Amazon
- Amazonia
- Analytics
- Appliance
- Applications
- Audio
- Augmented Reality
- Big data
- Bing
- Bitcoin
- Bitext
- Book review
- Business intelligence
- Business process
- Business strategy
- Censorship
- Cloud computing
- Company Profile
- Conferences
- Connectors
- Consulting
- Consumer
- Content processing
- Copyright
- Corporate Concerns
- Cost
- Crawl
- Crowdfunding
- cryptocurrency
- Customer support
- Cyber OSINT
- cybercrime
- cybersecurity
- Dark Web
- DarkCyber
- Data
- Data mining
- Database
- Deepfakes
- Digital Assistant
- Digital Library
- E2EE
- ECommerce
- EDiscovery
- Editorial opinion
- Education
- Emoticons
- Employment
- Enterprise
- Enterprise search
- Entity extraction
- Ethics
- Facebook
- Faceted search
- Factualities
- Feature
- Federated search
- Financial
- Fogint
- Google
- Governance
- Government
- Hackers
- healthcare
- IBM Watson
- Image search
- Indexing
- Infrastructure
- Innovation
- Integration
- intelware
- Interface
- Internet
- Interview
- Investment
- law enforcement
- Legal matters
- Library automation
- Management
- Marketing
- Mathematics
- Metadata
- Microsoft
- Mobile
- Natural language processing
- News
- NGIA
- Online (general)
- Open Access
- Open source
- OSINT
- Osint Radar
- Overflight
- Palantir
- Patents
- Personnel
- Podcast
- Policeware
- Portals
- Predictive coding
- Privacy
- Profile
- Publishing
- Quotation
- Real time search
- Reference tool
- Rich media
- Robot Writer
- Search
- Search enabled applications
- search engine
- Search quality
- Security
- Semantic
- Sentiment analysis
- SEO
- SharePoint
- Short Honks
- Smart Technology
- Social
- Social Media
- software
- Statistics
- Taxonomy
- Technology
- Text analytics
- Text processing
- Tools
- Tor
- Training
- Translation
- Twitter
- Uncategorized
- Unstructured Data
- User experience
- User Interface
- Vertical search
- Video
- visualization
- Voice search
- Voice technology
- Web 3
- Web Services
- Webinar
- Windows
- Work flow
- XML
- Yahoo

Beyond Search

Natural Language Generation: Sort of Made Clear

ChemNet: Pre Training and Rules Can Work but Time and Cost Can Be a Roadblock

Google Book Search: Broken Unfixable under Current Incentives

Amazon: Wheel Re-Invention

Facial Recognition and Image Recognition: Nervous Yet?

Google Struggles with Indexing?

Indexing Matters: The Investment Sector Analysis

The Semantic Web: Technology Roadkill or a Roadside Snack?

Bing: No More Public URL Submissions

Semantic Struggles and Metadata

Search the site

Categories

Archives

Recent Posts

Meta

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Search the site

Categories

Archives

Recent Posts

Meta