Federating Data: Easy, Hard, or Poorly Understood Until One Tries It at Scale?

March 8, 2019

I read two articles this morning.

One article explained that there’s a new way to deal with data federation. Always optimistic, I took a look at “Data-Driven Decision-Making Made Possible using a Modern Data Stack.” The revolution is to load data and then aggregate. The old way is to transform, aggregate, and model. Here’s a diagram from DAS43. A larger version is available at this link.

Hard to read. Yep, New Millennial colors. Is this a breakthrough?

I don’t know.

When I read “2 Reasons a Federated Database Isn’t Such a Slam-Dunk”, it seems that the solution outlined by DAS42 and the InfoWorld expert are not in sync.

There are two reasons. Count ‘em.

One: performance

Two: security.

Yeah, okay.

Some may suggest that there are a handful of other challenges. These range from deciding how to index audio, video, and images to figuring out what to do with different languages in the content to determining what data are “good” for the task at hand and what data are less “useful.” Date, time, and geocodes metadata are needed, but that introduces the not so easy to solve indexing problem.

So where are we with the “federation thing”?

Exactly the same place we were years ago…start ups and experts notwithstanding. But then one has to wrangle a lot of data. That’s cost, gentle reader. Big money.

Stephen E Arnold, March 8, 2019

Written by Stephen E. Arnold · Filed Under Data, Data mining, Database, Indexing, News | Comments Off on Federating Data: Easy, Hard, or Poorly Understood Until One Tries It at Scale?

ChemNet: Pre Training and Rules Can Work but Time and Cost Can Be a Roadblock

February 27, 2019

I read “New AI Approach Bridges the Slim Data Gap That Can Stymie Deep Learning Approaches.” The phrase “slim data” caught my attention. Pairing the phrase with “deep learning” seemed to point the way to the future.

The method described in the document reminded me that creating rules for “smart software” works on narrow domains with constraints on terminology. No emojis allowed. The method of “pre training” has been around since the early days of smart software. Autonomy in the mid 1990s relied upon training its “black box.”

Creating a training set which represents the content to be processed or indexed can be a time consuming, expensive business. Plus because content “drifts”, re-training is required. For some types of content, the training process must be repeated and verified.

So the cost of the rule creation, tuning and tweaking is one thing. The expense of training, training set tuning, and retraining is another. Add them up, and the objective of keeping costs down and accuracy up becomes a bit of a challenge.

The article focuses on the benefits of the new system as it crunches and munches its way through chemical data. The idea is to let software identify molecules for their toxicity.

Why hasn’t this type of smart software been used to index outputs at scale?

My hunch is that the time, cost, and accuracy of the indexing itself is a challenge. Eighty percent accuracy may be okay for some applications like identifying patients with a risk of diabetes. For identifying substances that will not kill one outright is another.

In short, the slim data gap and deep learning remain largely unsolved even for a constrained content domain.

Stephen E Arnold, February 27, 2019

Written by Stephen E. Arnold · Filed Under AI, Indexing, News | 1 Comment

Google Book Search: Broken Unfixable under Current Incentives

February 19, 2019

I read “How Badly is Google Books Search Broken, and Why?” The main point is that search results do not include the expected results. The culprit, as I understand the write up, looking for rare strings of characters within a time slice behaves in an unusual manner. I noted this statement:

So possibly Google has one year it displays for books online as a best guess, and another it uses internally to represent the year they have legal certainty a book is released. So maybe those volumes of the congressional record have had their access rolled back as Google realized that 1900 might actually mean 1997; and maybe Google doesn’t feel confident in library metadata for most of its other books, and doesn’t want searchers using date filters to find improperly released books. Oddly, this pattern seems to work differently on other searches. Trying to find another rare-ish term in Google Ngrams, I settled on “rarely used word”; the Ngrams database lists 192 uses before 2002. Of those, 22 show up in the Google index. A 90% disappearance rate is bad, but still a far cry from 99.95%.

There are many reasons one can identify for the apparent misbehavior of the Google search system for books. The author identifies the main reason but does not focus on it.

From my point of view and based on the research we have done for my various Google monographs, Google’s search systems operate in silos. But each shares some common characteristics even though the engineers, often reluctantly assigned to what are dead end or career stalling projects, make changes.

One of the common flaws has to do with the indexing process itself. None of the Google silos does a very good job with time related information. Google itself has a fix, but implementing the fix for most of its services is a cost increasing step.

The result is that Google focuses on innovations which can drive revenue; that is, online advertising for the mobile user of Google services.

But Google’s time blindness is unlikely to be remediated any time soon. For a better implementation of sophisticated time operations, take a look at the technology for time based retrieval, time slicing, and time analytics from the Google and In-Q-Tel funded company Recorded Future.

In my lectures about Google’s time blindness DNA, I compare and contrast what Recorded Future can do versus what Google silos are doing.

Net net: Performing sophisticated analyses of the Google indexes requires the type of tools available from Recorded Future.

Stephen E Arnold, February 19, 2019

Written by Stephen E. Arnold · Filed Under Google, Indexing, News | Comments Off on Google Book Search: Broken Unfixable under Current Incentives

DarkCyber for February 12, Now Available

February 12, 2019

DarkCyber for February 12, 2019, is now available at www.arnoldit.com/wordpress and on Vimeo at https://www.vimeo.com/316376994. The program is a production of Stephen E Arnold. It is the only weekly video news shows focusing on the Dark Web and lesser known Internet services.

This week’s story line up includes: Italy’s facial recognition system under fire; Marriott trains 500,000 employees to spot human traffickers; a new Dark Web search system from Portugal; and the most popular digital currencies on the hidden Web.

The first story explores the political criticism of Italy’s facial recognition system for law enforcement. The database of reference images contains about one third of Italy’s population. The system integrates with other biometric systems including the fingerprint recognition modules which is operating at several of Italy’s busiest airports. Despite the criticism, government authorities have no practical way to examine images for a match to a person of interest. DarkCyber believes image recognition is going to become more important and more widely used as its accuracy improves and costs come down.

The second story discusses Marriott Corporation’s two year training program. The hotel chain created information to help employees identify cues and signals of human trafficking. The instructional program also provides those attending with guidelines for taking appropriate action. Marriott has made the materials available to other groups. But bad actors have shifted their mode of operation to include short term rentals from Airbnb type vendors. Stephen E Arnold, producer of DarkCyber and author of “CyberOSINT: Next Generation Information Access, said: ”The anonymity of these types of temporary housing makes it easier for human traffickers to avoid detection. Prepaid credit cards, burner phones, and moving victims from property to property create an additional set of challenges for law enforcement”

The third story provides information about a new hidden Web indexing service. The vendor is Dogdaedis. The system uses “artificial intelligence” to index automatically the hidden services its crawler identifies. A number of companies are indexing and analyzing the Dark Web. Furthermore the number of Dark Web and hidden Web sites is decreasing due to increased pressure from law enforcement. Bad actors have adapted, shifting from traditional single point hidden Web sites to encrypted chat services.

The final story extracts from a Recorded Future report the most popular digital currencies on the Dark Web. Bitcoin is losing ground to Litecoin and Monero.

A new blog Dark Cyber Annex is now available at www.arnoldit.com/wordpress. Cyber crime, Dark Web, and company profiles are now appearing on a daily basis.

Kenny Toth, February 12, 2019

Written by Stephen E. Arnold · Filed Under Dark Web, DarkCyber, News, Video | Comments Off on DarkCyber for February 12, Now Available

Why Not Filter for These Hashtags?

January 4, 2019

This is one of our DarkCyber news items.

The DarkCyber research time noted some of the child porn hashtags. A list was released by the Child Rescue Coalition. If you are a sworn law enforcement office, complete the form and request the full list.

In the news reports about the CRC’s list, we compiled some of the words and phrases used to allow bad actors to locate child porn.

Here’s a partial list of hashtags:

#babykini
#babypeeing
#bathtime
#bathtimefun
#bikinikids
#bikinikidslovers
#bikinikidsmodeling
#cantkeepclothesonhim
#cleankids
#diaperfree
#diaperfree
#kidbikini
#kidsshower
#kidsshowertime
#kidsswimwear
#lillootoddlerpotty
#lovesbeingnude
#modelingchild
#nakedbaby
#nakedchild
#nakedchildren
#nakedkid
#nakedkiddos
#nakedkids
#nakedkidsagain
#nakedkidsarehappykids
#nakedkidsclub
#nakedkidseverywhere
#nakedtoddler
#nakedtoddleralert
#nappyfree
#nudechild
#nudekids
#peeingkid
#potty
#pottydance
#pottydanceparty
#pottydancetime
#pottylife
#pottyparty
#pottytime
#pottytrain
#pottytrained
#pottytrainedbefore2
#pottytraining
#pottytraining101
#pottytraining4kids
#pottytrainingbootcamp
#pottytrainingboys
#pottytrainingdays
#pottytrainingdiaries
#pottytrainingfail
#pottytrainingfun
#pottytrainingguide
#pottytrainingsuccess
#pottytrainingsucks
#pottytrainingtime
#pottytrainingtwins
#pottytrainingwoes
#sexychildren
#sexykids
#sinkchild
#skinnybabybooty
#skinnybabybooty
#skinnybabybooty.
#startpottytraining
#toddlerbathfun
#toddlerbathing
#toddlerbaths
#toddlerbikinis
#toddlerbikinisrule
#toilettrain
#toilettraining
toddlerbikini

There are, of course, numerous variations, which should be relatively easy to map to an old school filter—unless of course the service doesn’t want to lose ad revenue or invest developer time in this offensive exploitation of indexing terms.

Several thoughts:

Services should filter for these terms, note the individual or handle using the hashtag, and compile that data. The information may be useful to law enforcement.
Why are social media services not blocking these hashtags? DarkCyber knows that bad actors will cook up new terms or use emoji combinations, but mapping the tags to identities may be useful in some investigations.
These terms strike DarkCyber as obvious? What other hashtags are in use?

These “in plain sight” index terms are available to anyone with an Internet connection. No Dark Web, Tails, or Whonix required.

Stephen E Arnold, January 4, 2018

Written by Stephen E. Arnold · Filed Under DarkCyber, News | Comments Off on Why Not Filter for These Hashtags?

Data Protection: Many Vendors, Many Incidents

January 4, 2019

This is one of our DarkCyber news items.

Search engines are getting smarter and better, especially since they began to incorporate social media in their indexing. It is harder than ever to protect personal information, then there is the rising Dark Web fear. While there are services out there that say they can monitor the Dark Web and the vanilla Web to protect your information there are things you can do to protect yourself. TechRadar shares some tips in the article, “AI And The Next Generation Of Search Engines.”

The article focuses on Xiliab’s Frank Cha, who works on South Korea’s largest AI developer. Xiliab recently developed the DataXchain data trading platform that is described as the search engine of the future. Cha explained why DataXchain is the search engine of the future:

“Dataxchain engine is the next generation of data trading engine which enables not only data processing such as automatic data collection, classification, tagging, and curation but also enables data transactions. These transactions are directly applied to human development without human intervention by pre-processing data matching and deep learning engine. These trials can be accessed to the implicit knowledge through the intervention of people that the traditional search engine already had.”

Cha stresses the biggest challenge with DataXchain is creating connections with clients. He said, “When this connection becomes a chain, we will be able to exchange value for private data of each individual or organization and it will bring innovation to sophisticated AI in dataXchain…”

It is also being for national defense, which can be translated into protecting an individual’s data without changing the algorithm.

It is a basic interview without much meat about how to protect your data. Defensive forces can use the same algorithm as regular people, but that does not sound reassuring. How about speaking in layman’s terms?

With many competitors why are their so many successful breaches?

Whitney Grace, January 4, 2019

Written by Stephen E. Arnold · Filed Under AI, DarkCyber, News, Security | Comments Off on Data Protection: Many Vendors, Many Incidents

Amazon: Wheel Re-Invention

December 19, 2018

Some languages have bound phrases; that is, two words which go together. Examples include “White House”, a presidential dwelling, and “ticket counter”, a place to talk with an uninterested airline professionals. How does a smart software system recognize a bound phrase and then connect it to the speaker’s or writer’s intended meaning. There is a difference between “I toured the White House” and “Turn left at the white house.”

Traditionally, vendors of text analysis, indexing, and NLP systems used jargon to explain a collection of methods pressed into action to make sense of language quirks. The guts of most systems are word lists, training material selected to make clear that in certain contexts some words go together and have a specific meaning; for example, “terminal” doesn’t make much sense until one gets whether the speaker or writer is referencing a place to board a train (railroad terminal), the likely fate of a sundowner (terminal as in dead), or a computer interface device (dumb terminal).

How does Amazon accomplish this magic? Amazon embraces jargon, of course, and then explains its bound phrase magic in “How Alexa Knows “Peanut Butter” Is One Shopping-List Item, Not Two.”

Amazon’s spin is spoken language understanding. The write up explains how the system operates. But the methods are ones that others have used. Amazon, to be sure, has tweaked the procedures. That’s standard operating procedure in the index game.

What’s interesting is that no reference is made to the contextual information which Amazon has to assist its smart software with disambiguation.

But Amazon is now talking, presumably to further the message that the company is a bold, brave innovator.

No argument from Harrod’s Creek. That’s a bound phrase, by the way, with capital letters and sometimes and apostrophe or not.

Stephen E Arnold, December 19, 2018

Written by Stephen E. Arnold · Filed Under Alexa, Amazon, Indexing, News | Comments Off on Amazon: Wheel Re-Invention

Belgium and the GOOG

December 14, 2018

In 2012, Google cut a deal with Belgium publishers over content scraping. The idea was that indexing public Web sites was not something that put a smile on some Belgium publishers’ faces. Google’s approach to settlements has warranted its own news item on a Harvard Web site.

Belgium — for the most part — is a quiet, western European country that accepts a couple of languages as standard and cranks out pretty good waffles. Apparently, Belgium does not like it when Google exposes its top secret military bases. Yes, I think exposing a nation’s national secrets is a good reason to be mad and sue. Fast Company reports that, “Belgium Is Suing Google Over Satellite Images Of Military Bases” and Google is not listening to them.

Belgium has asked Google to blur out images of its military bases from its satellite photographs. The country has also requested Google blur out its nuclear power plants and air bases as well. Belgium is not happy:

“The defense ministry made the request citing national security. It’s not clear why Google has not honored that request, as it is a standard one for governments to make of the search giant, which in the past had no problem obscuring images of sensitive military sites. We’ve reached out to Google for comment and will update this post if we hear back.”

A Belgian Google representative explained that his company has worked closely with the Belgium Department of Defense before to change Google’s maps and is disappointed they are now being sued. Google plans to continue working with the Belgian government to resolve the issue.

It is reassuring that Google methods do not discriminate based on the size of a country.

Whitney Grace, December 14, 2018

Written by Stephen E. Arnold · Filed Under Google, News | Comments Off on Belgium and the GOOG

Semantic SEO: A Frothy Romp

November 6, 2018

Someone spent a long, long time assembling the information included in “Using Topic Modelling to Win Big with NLP and Semantic Search.” [The original spells “modelling” with two Ls. I have changed the spelling in my write up.] I am not exactly sure what “semantic search” means. I have a glimmer of understanding about natural language processing. Whether it works as one assumes is, of course, another thing entirely. The idea of “topic modeling” is new. “Models” I get. Topic modeling, not so much. My thought is that the phrase means indexing and categorization. But?

The slide deck covers quite a bit of ground in the Microsoft / LinkedIn / Slideshare document. The lingo in the document includes a bountiful gathering of buzzwords.Also, there’s an equation, although, I am not certain it clarifies. Could it be that its inclusion is intended to add some mathiness to the confection?

Here you go. Channel your inner Leibnitz with an intuitive view:

Remarkable what SEO experts can assemble.

Stephen E Arnold, November 6, 2018

Written by Stephen E. Arnold · Filed Under News, SEO | Comments Off on Semantic SEO: A Frothy Romp

Search Becomes More Like Human Memory

November 1, 2018

The one advantage humans have always had over computers is our dynamic ability to index thoughts, memories, and opinions. However, those days of superiority might be over if one company has its way. We learned more about how our search is becoming a lot more like a brain in a recent Silicon Canals story, “History Search: Here’s How This Rotterdam Startup Helps to Retrieve Information Online.”

Here’s the short and sweet on History Search:

“With this startup, you can keep your information organized on the web. It is done by indexing the text on the web pages automatically while browsing and making it searchable with any keyword that you remember. Basically, History Search saves your time every time you need to open a web page.”

Basically, you can recall a snippet of something you once read ages ago and have it brought back to your eyes. Sounds a lot like a human brain. If that wasn’t weird enough, some AI companies are even making more human like strides. For example, experts think smells will soon be indexed and searchable. Smell any hype lately?

Patrick Roland, November 1, 2018

Written by Stephen E. Arnold · Filed Under AI, News | Comments Off on Search Becomes More Like Human Memory

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.

Categories
- 3D-Printing
- Acquisition
- Advertising
- Aggregation
- AI
- Alexa
- algorithms
- Amazon
- Amazonia
- Analytics
- Appliance
- Applications
- Audio
- Augmented Reality
- Big data
- Bing
- Bitcoin
- Bitext
- Book review
- Business intelligence
- Business process
- Business strategy
- Censorship
- Cloud computing
- Company Profile
- Conferences
- Connectors
- Consulting
- Consumer
- Content processing
- Copyright
- Corporate Concerns
- Cost
- Crawl
- Crowdfunding
- cryptocurrency
- Customer support
- Cyber OSINT
- cybercrime
- cybersecurity
- Dark Web
- DarkCyber
- Data
- Data mining
- Database
- Deepfakes
- Digital Assistant
- Digital Library
- E2EE
- ECommerce
- EDiscovery
- Editorial opinion
- Education
- Emoticons
- Enterprise
- Enterprise search
- Entity extraction
- Ethics
- Facebook
- Faceted search
- Factualities
- Feature
- Federated search
- Financial
- Fogint
- Google
- Governance
- Government
- Hackers
- healthcare
- IBM Watson
- Image search
- Indexing
- Infrastructure
- Innovation
- Integration
- intelware
- Interface
- Internet
- Interview
- Investment
- law enforcement
- Legal matters
- Library automation
- Management
- Marketing
- Mathematics
- Metadata
- Microsoft
- Mobile
- Natural language processing
- News
- NGIA
- Online (general)
- Open Access
- Open source
- OSINT
- Osint Radar
- Overflight
- Palantir
- Patents
- Personnel
- Podcast
- Policeware
- Portals
- Predictive coding
- Privacy
- Profile
- Publishing
- Quotation
- Real time search
- Reference tool
- Rich media
- Robot Writer
- Search
- Search enabled applications
- search engine
- Search quality
- Security
- Semantic
- Sentiment analysis
- SEO
- SharePoint
- Short Honks
- Smart Technology
- Social
- Social Media
- software
- Statistics
- Taxonomy
- Technology
- Text analytics
- Text processing
- Tools
- Tor
- Training
- Translation
- Twitter
- Uncategorized
- Unstructured Data
- User experience
- User Interface
- Vertical search
- Video
- visualization
- Voice search
- Voice technology
- Web 3
- Web Services
- Webinar
- Windows
- Work flow
- XML
- Yahoo

Beyond Search

Federating Data: Easy, Hard, or Poorly Understood Until One Tries It at Scale?

ChemNet: Pre Training and Rules Can Work but Time and Cost Can Be a Roadblock

Google Book Search: Broken Unfixable under Current Incentives

DarkCyber for February 12, Now Available

Why Not Filter for These Hashtags?

Data Protection: Many Vendors, Many Incidents

Amazon: Wheel Re-Invention

Belgium and the GOOG

Semantic SEO: A Frothy Romp

Search Becomes More Like Human Memory

Search the site

Categories

Archives

Recent Posts

Meta

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Search the site

Categories

Archives

Recent Posts

Meta