ChemNet: Pre Training and Rules Can Work but Time and Cost Can Be a Roadblock

February 27, 2019

I read “New AI Approach Bridges the Slim Data Gap That Can Stymie Deep Learning Approaches.” The phrase “slim data” caught my attention. Pairing the phrase with “deep learning” seemed to point the way to the future.

The method described in the document reminded me that creating rules for “smart software” works on narrow domains with constraints on terminology. No emojis allowed. The method of “pre training” has been around since the early days of smart software. Autonomy in the mid 1990s relied upon training its “black box.”

Creating a training set which represents the content to be processed or indexed can be a time consuming, expensive business. Plus because content “drifts”, re-training is required. For some types of content, the training process must be repeated and verified.

So the cost of the rule creation, tuning and tweaking is one thing. The expense of training, training set tuning, and retraining is another. Add them up, and the objective of keeping costs down and accuracy up becomes a bit of a challenge.

The article focuses on the benefits of the new system as it crunches and munches its way through chemical data. The idea is to let software identify molecules for their toxicity.

Why hasn’t this type of smart software been used to index outputs at scale?

My hunch is that the time, cost, and accuracy of the indexing itself is a challenge. Eighty percent accuracy may be okay for some applications like identifying patients with a risk of diabetes. For identifying substances that will not kill one outright is another.

In short, the slim data gap and deep learning remain largely unsolved even for a constrained content domain.

Stephen E Arnold, February 27, 2019

Google Book Search: Broken Unfixable under Current Incentives

February 19, 2019

I read “How Badly is Google Books Search Broken, and Why?” The main point is that search results do not include the expected results. The culprit, as I understand the write up, looking for rare strings of characters within a time slice behaves in an unusual manner. I noted this statement:

So possibly Google has one year it displays for books online as a best guess, and another it uses internally to represent the year they have legal certainty a book is released. So maybe those volumes of the congressional record have had their access rolled back as Google realized that 1900 might actually mean 1997; and maybe Google doesn’t feel confident in library metadata for most of its other books, and doesn’t want searchers using date filters to find improperly released books. Oddly, this pattern seems to work differently on other searches. Trying to find another rare-ish term in Google Ngrams, I settled on “rarely used word”; the Ngrams database lists 192 uses before 2002. Of those, 22 show up in the Google index. A 90% disappearance rate is bad, but still a far cry from 99.95%.

There are many reasons one can identify for the apparent misbehavior of the Google search system for books. The author identifies the main reason but does not focus on it.

From my point of view and based on the research we have done for my various Google monographs, Google’s search systems operate in silos. But each shares some common characteristics even though the engineers, often reluctantly assigned to what are dead end or career stalling projects, make changes.

One of the common flaws has to do with the indexing process itself. None of the Google silos does a very good job with time related information. Google itself has a fix, but implementing the fix for most of its services is a cost increasing step.

The result is that Google focuses on innovations which can drive revenue; that is, online advertising for the mobile user of Google services.

But Google’s time blindness is unlikely to be remediated any time soon. For a better implementation of sophisticated time operations, take a look at the technology for time based retrieval, time slicing, and time analytics from the Google and In-Q-Tel funded company Recorded Future.

In my lectures about Google’s time blindness DNA, I compare and contrast what Recorded Future can do versus what Google silos are doing.

Net net: Performing sophisticated analyses of the Google indexes requires the type of tools available from Recorded Future.

Stephen E Arnold, February 19, 2019

DarkCyber for February 12, Now Available

February 12, 2019

DarkCyber for February 12, 2019, is now available at www.arnoldit.com/wordpress and on Vimeo at https://www.vimeo.com/316376994. The program is a production of Stephen E Arnold. It is the only weekly video news shows focusing on the Dark Web and lesser known Internet services.

This week’s story line up includes: Italy’s facial recognition system under fire; Marriott trains 500,000 employees to spot human traffickers; a new Dark Web search system from Portugal; and the most popular digital currencies on the hidden Web.

The first story explores the political criticism of Italy’s facial recognition system for law enforcement. The database of reference images contains about one third of Italy’s population. The system integrates with other biometric systems including the fingerprint recognition modules which is operating at several of Italy’s busiest airports. Despite the criticism, government authorities have no practical way to examine images for a match to a person of interest. DarkCyber believes image recognition is going to become more important and more widely used as its accuracy improves and costs come down.

The second story discusses Marriott Corporation’s two year training program. The hotel chain created information to help employees identify cues and signals of human trafficking. The instructional program also provides those attending with guidelines for taking appropriate action. Marriott has made the materials available to other groups. But bad actors have shifted their mode of operation to include short term rentals from Airbnb type vendors. Stephen E Arnold, producer of DarkCyber and author of “CyberOSINT: Next Generation Information Access, said: ”The anonymity of these types of temporary housing makes it easier for human traffickers to avoid detection. Prepaid credit cards, burner phones, and moving victims from property to property create an additional set of challenges for law enforcement”

The third story provides information about a new hidden Web indexing service. The vendor is Dogdaedis. The system uses “artificial intelligence” to index automatically the hidden services its crawler identifies. A number of companies are indexing and analyzing the Dark Web. Furthermore the number of Dark Web and hidden Web sites is decreasing due to increased pressure from law enforcement. Bad actors have adapted, shifting from traditional single point hidden Web sites to encrypted chat services.

The final story extracts from a Recorded Future report the most popular digital currencies on the Dark Web. Bitcoin is losing ground to Litecoin and Monero.

A new blog Dark Cyber Annex is now available at www.arnoldit.com/wordpress. Cyber crime, Dark Web, and company profiles are now appearing on a daily basis.

Kenny Toth, February 12, 2019

Why Not Filter for These Hashtags?

January 4, 2019

This is one of our DarkCyber news items.

The DarkCyber research time noted some of the child porn hashtags. A list was released by the Child Rescue Coalition. If  you are a sworn law enforcement office, complete the form and request the full list.

In the news reports about the CRC’s list, we compiled some of the words and phrases used to allow bad actors to locate child porn.

Here’s a partial list of hashtags:

#babykini
#babypeeing
#bathtime
#bathtimefun
#bikinikids
#bikinikidslovers
#bikinikidsmodeling
#cantkeepclothesonhim
#cleankids
#diaperfree
#diaperfree
#kidbikini
#kidsshower
#kidsshowertime
#kidsswimwear
#lillootoddlerpotty
#lovesbeingnude
#modelingchild
#nakedbaby
#nakedchild
#nakedchildren
#nakedkid
#nakedkiddos
#nakedkids
#nakedkidsagain
#nakedkidsarehappykids
#nakedkidsclub
#nakedkidseverywhere
#nakedtoddler
#nakedtoddleralert
#nappyfree
#nudechild
#nudekids
#peeingkid
#potty
#pottydance
#pottydanceparty
#pottydancetime
#pottylife
#pottyparty
#pottytime
#pottytrain
#pottytrained
#pottytrainedbefore2
#pottytraining
#pottytraining101
#pottytraining4kids
#pottytrainingbootcamp
#pottytrainingboys
#pottytrainingdays
#pottytrainingdiaries
#pottytrainingfail
#pottytrainingfun
#pottytrainingguide
#pottytrainingsuccess
#pottytrainingsucks
#pottytrainingtime
#pottytrainingtwins
#pottytrainingwoes
#sexychildren
#sexykids
#sinkchild
#skinnybabybooty
#skinnybabybooty
#skinnybabybooty.
#startpottytraining
#toddlerbathfun
#toddlerbathing
#toddlerbaths
#toddlerbikinis
#toddlerbikinisrule
#toilettrain
#toilettraining
toddlerbikini

There are, of course, numerous variations, which should be relatively easy to map to an old school filter—unless of course the service doesn’t want to lose ad revenue or invest developer time in this offensive exploitation of indexing terms.

Several thoughts:

  1. Services should filter for these terms, note the individual or handle using the hashtag, and compile that data. The information may be useful to law enforcement.
  2. Why are social media services not blocking these hashtags? DarkCyber knows that bad actors will cook up new terms or use emoji combinations, but mapping the tags to identities may be useful in some investigations.
  3. These terms strike DarkCyber as obvious? What other hashtags are in use?

These “in plain sight” index terms are available to anyone with an Internet connection. No Dark Web, Tails, or Whonix required.

Stephen E Arnold, January 4, 2018

Data Protection: Many Vendors, Many Incidents

January 4, 2019

This is one of our DarkCyber news items.

Search engines are getting smarter and better, especially since they began to incorporate social media in their indexing. It is harder than ever to protect personal information, then there is the rising Dark Web fear. While there are services out there that say they can monitor the Dark Web and the vanilla Web to protect your information there are things you can do to protect yourself. TechRadar shares some tips in the article, “AI And The Next Generation Of Search Engines.”

The article focuses on Xiliab’s Frank Cha, who works on South Korea’s largest AI developer. Xiliab recently developed the DataXchain data trading platform that is described as the search engine of the future. Cha explained why DataXchain is the search engine of the future:

“Dataxchain engine is the next generation of data trading engine which enables not only data processing such as automatic data collection, classification, tagging, and curation but also enables data transactions. These transactions are directly applied to human development without human intervention by pre-processing data matching and deep learning engine. These trials can be accessed to the implicit knowledge through the intervention of people that the traditional search engine already had.”

Cha stresses the biggest challenge with DataXchain is creating connections with clients. He said, “When this connection becomes a chain, we will be able to exchange value for private data of each individual or organization and it will bring innovation to sophisticated AI in dataXchain…”

It is also being for national defense, which can be translated into protecting an individual’s data without changing the algorithm.

It is a basic interview without much meat about how to protect your data. Defensive forces can use the same algorithm as regular people, but that does not sound reassuring. How about speaking in layman’s terms?

With many competitors why are their so many successful breaches?

Whitney Grace, January 4, 2019

Amazon: Wheel Re-Invention

December 19, 2018

Some languages have bound phrases; that is, two words which go together. Examples include “White House”, a presidential dwelling, and “ticket counter”, a place to talk with an uninterested airline professionals. How does a smart software system recognize a bound phrase and then connect it to the speaker’s or writer’s intended meaning. There is a difference between “I toured the White House” and “Turn left at the white house.”

Traditionally, vendors of text analysis, indexing, and NLP systems used jargon to explain a collection of methods pressed into action to make sense of language quirks. The guts of most systems are word lists, training material selected to make clear that in certain contexts some words go together and have a specific meaning; for example, “terminal” doesn’t make much sense until one gets whether the speaker or writer is referencing a place to board a train (railroad terminal), the likely fate of a sundowner (terminal as in dead), or a computer interface device (dumb terminal).

How does Amazon accomplish this magic? Amazon embraces jargon, of course, and then explains its bound phrase magic in “How Alexa Knows “Peanut Butter” Is One Shopping-List Item, Not Two.”

Amazon’s spin is spoken language understanding. The write up explains how the system operates. But the methods are ones that others have used. Amazon, to be sure, has tweaked the procedures. That’s standard operating procedure in the index game.

What’s interesting is that no reference is made to the contextual information which Amazon has to assist its smart software with disambiguation.

But Amazon is now talking, presumably to further the message that the company is a bold, brave innovator.

No argument from Harrod’s Creek. That’s a bound phrase, by the way, with capital letters and sometimes and apostrophe or not.

Stephen E Arnold, December 19, 2018

Belgium and the GOOG

December 14, 2018

In 2012, Google cut a deal with Belgium publishers over content scraping. The idea was that indexing public Web sites was not something that put a smile on some Belgium publishers’ faces. Google’s approach to settlements has warranted its own news item on a Harvard Web site.

Belgium — for the most part — is a quiet, western European country that accepts a couple of languages as standard and cranks out pretty good waffles. Apparently, Belgium does not like it when Google exposes its top secret military bases. Yes, I think exposing a nation’s national secrets is a good reason to be mad and sue. Fast Company reports that, “Belgium Is Suing Google Over Satellite Images Of Military Bases” and Google is not listening to them.

Belgium has asked Google to blur out images of its military bases from its satellite photographs. The country has also requested Google blur out its nuclear power plants and air bases as well. Belgium is not happy:

“The defense ministry made the request citing national security. It’s not clear why Google has not honored that request, as it is a standard one for governments to make of the search giant, which in the past had no problem obscuring images of sensitive military sites. We’ve reached out to Google for comment and will update this post if we hear back.”

A Belgian Google representative explained that his company has worked closely with the Belgium Department of Defense before to change Google’s maps and is disappointed they are now being sued. Google plans to continue working with the Belgian government to resolve the issue.

It is reassuring that Google methods do not discriminate based on the size of a country.

Whitney Grace, December 14, 2018

Semantic SEO: A Frothy Romp

November 6, 2018

Someone spent a long, long time assembling the information included in “Using Topic Modelling to Win Big with NLP and Semantic Search.” [The original spells “modelling” with two Ls. I have changed the spelling in my write up.] I am not exactly sure what “semantic search” means. I have a glimmer of understanding about natural language processing. Whether it works as one assumes is, of course, another thing entirely. The idea of “topic modeling” is new. “Models” I get. Topic modeling, not so much. My thought is that the phrase means indexing and categorization. But?

The slide deck covers quite a bit of ground in the Microsoft / LinkedIn / Slideshare document. The lingo in the document includes a bountiful gathering of buzzwords.Also, there’s an equation, although, I am not certain it clarifies. Could it be that its inclusion is intended to add some mathiness to the confection?

Here you go. Channel your inner Leibnitz with an intuitive view:

image

Remarkable what SEO experts can assemble.

Stephen E Arnold, November 6, 2018

image

Search Becomes More Like Human Memory

November 1, 2018

The one advantage humans have always had over computers is our dynamic ability to index thoughts, memories, and opinions. However, those days of superiority might be over if one company has its way. We learned more about how our search is becoming a lot more like a brain in a recent Silicon Canals story, “History Search: Here’s How This Rotterdam Startup Helps to Retrieve Information Online.”

Here’s the short and sweet on History Search:

“With this startup, you can keep your information organized on the web. It is done by indexing the text on the web pages automatically while browsing and making it searchable with any keyword that you remember. Basically, History Search saves your time every time you need to open a web page.”

Basically, you can recall a snippet of something you once read ages ago and have it brought back to your eyes. Sounds a lot like a human brain. If that wasn’t weird enough, some AI companies are even making more human like strides. For example, experts think smells will soon be indexed and searchable. Smell any hype lately?

Patrick Roland, November 1, 2018

DarkCyber for October 16, 2018 Is Now Available

October 16, 2018

DarkCyber for October 16, 2018, is now available at www.arnoldit.com/wordpress and on Vimeo at https://vimeo.com/295026034

Stephen E Arnold’s DarkCyber is a weekly video news and analysis program about the Dark Web and lesser known Internet services. This week’s program covers three stories related to the Dark Web and specialized Internet services.

The first story explores what appears to be a public relations spat between two Dark Web indexing vendors. Terbium Labs offers it Matchlight service to government and commercial companies. Digital Shadows sells its SearchLight service to the same markets. Terbium Labs issued a new report. The document asserts that data collection about the Dark Web and related services has to be more stringent and consistent. Digital Shadows response was a report that for $150 Dark Web bad actors would hack the email account of any employee. The data used to back the claim were general, and they lacked the specificity that Terbium Labs desires. DarkCyber’s view is that Terbium Lab is advocating a “high road”; that is, more diligent data collection and more comprehensive indexing. Digital Shadows, on the other hand, seems to be embrace the IBM approach to marketing by emphasizing uncertainty and doubt.

The second story reports that PureTech Systems has announced it fully autonomous drone platform. When a sensor is activate, the PureTech drone can launch itself, navigate to the specific location identified by the sensor, and began collecting information in real time. The data are then fed in real time into the PureTech analytics subsystem. Tasks which once required specialists and intelligence analysts can now be shifted to the PureTech platform.

The final story for the October 16, 2018, is the failure of a California film professional to arrange for a Dark Web murder. After police received a tip, the person of interest was arrested. His missteps included using his California driver’s license to purchase Bitcoin to pay the Dark Web hit man. The interest in murder for hire seems to be high; however, most of those visiting these sites do not realize that they are scams. The California man paid $5 down on the hit, but his payoff was a stay in jail, not the termination of his step mother.

DarkCyber appears each Tuesday on the blog Beyond Search and on Vimeo. A four part series about Amazon’s policeware capabilities begins on October 30, 2018. Watch for these programs at www.arnoldit.com/wordpress.

Kenny Toth, October 16, 2018

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta