ArnoldIT Publishes Technical Analysis of the Bitext Deep Linguistic Analysis Platform

July 19, 2017

ArnoldIT has published “Bitext: Breakthrough Technology for Multi-Language Content Analysis.” The analysis provides the first comprehensive review of the Madrid-based company’s Deep Linguistic Analysis Platform or DLAP. Unlike most next-generation multi-language text processing methods, Bitext has crafted a platform. The document can be downloaded from the Bitext Web site via this link.

Based on information gathered by the study team, the Bitext DLAP system outputs metadata with an accuracy in the 90 percent to 95 percent range.
Most content processing systems today typically deliver metadata and rich indexing with accuracy in the 70 to 85 percent range.

According to Stephen E Arnold, publisher of Beyond Search and Managing Director of Arnold Information Technology:

“Bitext’s output accuracy establish a new benchmark for companies offering multi-language content processing system.”

The system performs in near real time, more than 15 discrete analytic processes. The system can output enhanced metadata for more than 50 languages. The structured stream provides machine learning systems with a low cost, highly accurate way to learn. Bitext’s DLAP platform integrates more than 30 separate syntactic functions. These include segmentation, tokenization (word segmentation, frequency, and disambiguation, among others. The DLAP platform analyzes more than 15 linguistic features of content in any of the more than 50 supported languages. The system extracts entities and generates high-value data about documents, emails, social media posts, Web pages, and structured and semi-structured data.

DLAP Applications range from fraud detection to identifying nuances in streams of data; for example, the sentiment or emotion expressed in a document. Bitext’s system can output metadata and other information about processed content as a feed stream to specialized systems such as Palantir Technologies’ Gotham or IBM’s Analyst’s Notebook. Machine learning systems such as those operated by such companies as Amazon, Apple, Google, and Microsoft can “snap in” the Bitext DLAP platform.

Copies of the report are available directly from Bitext at https://info.bitext.com/multi-language-content-analysis Information about Bitext is available at www.bitext.com.

Kenny Toth, July 19, 2017

Written by Stephen E. Arnold · Filed Under Analytics, News, Technology, Text analytics, Text processing | Comments Off on ArnoldIT Publishes Technical Analysis of the Bitext Deep Linguistic Analysis Platform

IBM Watson: Predicting the Future

July 12, 2017

I enjoy IBM’s visions of the future. One exception: The company’s revenue estimates for the Watson product line is an exception. I read “IBM Declares AI the Key to Making Unstructured Data Useful.” For me, the “facts” in the write up are a bit like a Payday candy bar. Some nuts squished into a squishy core of questionable nutritional value.

I noted this factoid:

80 percent of company data is unstructured, including free-form documents, images, and voice recordings.

I have been interested in the application of the 80-20 rule to certain types of estimates. The problem is that the ‘principle of factor sparsity” gets disconnected from the underlying data. Generalizations are just so darned fun and easy. The problem is that the mathematical rigor necessary to validate the generalization is just too darned much work. The “hey, I’ve got a meeting” or the more common “I need to check my mobile” get in the way of figuring out if the 80-20 statement makes sense.

My admittedly inept encounters with data suggest that the volume of unstructured data is high, higher that the 80 percent in the rule. The problem is that today’s systems struggle to:

Make sense of massive streams of unstructured data from outfits like YouTube, clear text and encrypted text messages, and the information blasted about on social media
Identify the important items of content directly germane to a particular matter
Figure out how to convert content processing into useful elements like named entities and relate those entities to code words and synonyms
Perform cost effective indexing of content streams in near real time.

At this time, systems designed to extract actionable information from relatively small chunks of content are improving. But these systems typically break down when the volume exceeds the budget and computing resources available to those trying to “make sense” of the data in a finite amount of time. This type of problem is difficult due to constraints on the systems. These constraints are financial as in “who has the money available right now to process these streams?” These constraints are problematic when someone asks “what do we do with the data in this dialect from northern Afghanistan?” And there are other questions.

My problem with the IBM approach is that the realities of volume, interrelating structured and semi structured data, and multi lingual content is that these bumps in the information super highway Watson seems to speed along are absorbed by marketing fluffiness.

I loved this passage:

Chatterjee highlighted Macy’s as an example of an IBM customer that’s using the company’s tools to better personalize customers’ shopping experiences using AI. The Macy’s On Call feature lets customers get information about what’s in stock and other key details about the contents of a retail store, without a human sales associate present. It uses Watson’s natural language understanding capabilities to process user queries and provide answers. Right now, that feature is available as part of a pilot in 10 Macy’s stores.

Yep, I bet that Macy’s is going to hit a home run against the fast ball pitching of Jeff Bezos’ Amazon Prime team. Let’s ask Watson. On the other hand, let’s ask Alexa.

Stephen E Arnold, July 12, 2017

Written by Stephen E. Arnold · Filed Under IBM Watson, News | 1 Comment

Can an Algorithm Tame Misinformation Online?

June 23, 2017

UCLA researchers are working on an algorithmic solution to the “fake news” problem, we learn from the article, “Algorithm Reads Millions of Posts on Parenting Sites in Bid to Understand Online Misinformation” at TechRadar. Okay, it’s actually indexing and text analysis, not “reading,” but we get the idea. Reporter Duncan Geere tells us:

There’s a special logic to the flow of posts on a forum or message board, one that’s easy to parse by someone who’s spent a lot of time on them but kinda hard to understand for those who haven’t. Researchers at UCLA are working on teaching computers to understand these structured narratives within chronological posts on the web, in an attempt to get a better grasp of how humans think and communicate online.

Researchers used the hot topic of vaccinations, as discussed on two parenting forums, as their test case. Through an examination of nearly 2 million posts, the algorithm was able to come to accurate conclusions, or “narrative framework.” Geere writes:

While this study was targeted at conversations around vaccination, the researchers say the same principles could be applied to any topic. Down the line, they hope it could allow for false narratives to be identified as they develop and countered by targeted messaging.

The phrase “down the line” is incredibly vague, but the sooner the better, we say (though we wonder exactly what form this “targeted messaging” will take). The original study can be found here at eHealth publisher JMIR Publications.

Cynthia Murrell, June 23, 2017

Written by Stephen E. Arnold · Filed Under algorithms, healthcare, News, Publishing | Comments Off on Can an Algorithm Tame Misinformation Online?

Editorial Controls and Data Governance: A Rose by Any Other Name?

June 16, 2017

I read “Why Interest In “Data Governance” Is Increasing.” The write up uses a number of terms to describe what I view as editorial controls. The idea in my experience is that an organization decides what it okay and not okay with regards to the information it wants to process. The object is to know what content will be processed before the organization kick starts indexing, metadata tagging, or text analysis.

The organization then has to figure out and implement the rules of the game. Questions like “What do we do when entities are not recognized?” and “Who goes through the exceptions file?” must be answered. Rules, procedures, processes, and corrective actions have to be implemented in the work flow. One cannot calculate costs, headcount, or software expenses unless one knows what’s going to happen.

The write up explains that data governance is important. I agree. The write up hooks the notion of editorial controls and editorial process to a number of buzzwords. I don’t think this type of jargon catalog is particularly helpful. Jargon distracts some people from focusing on Job One; that is, putting appropriate controls in place before nuking the budget or creating the type of editorial craziness which Facebook and Google are now trying to contain and manage.

The notion that an organization has to perform “data program management” is fine. But this is nothing more than hooking the editorial rules of the road to the responsibilities of the people who have to set up, oversee, and change the work flow.

Jargon does not help implement editorial controls. Clear thinking and speaking do.

Stephen E Arnold, June 16, 2017

Written by Stephen E. Arnold · Filed Under Management, News, Work flow | 2 Comments

AI Decides to Do the Audio Index Dance

June 14, 2017

Did you ever wonder how search engines could track down the most miniscule information? Their power resides in indices that catalog Web sites, images, and books. Audio content is harder to index because most indices rely on static words and images. However, Audioburst plans to change that says Venture Beat in the article, “How Audioburst Is Using AI To Index Audio Broadcasts And Make Them Easy To Find.”

Who exactly is Audioburst?

Founded in 2015, Audioburst touts itself as a “curation and search site for radio,” delivering the smarts to render talk radio in real time, index it, and make it easily accessible through search engines. It does this through “understanding” the meaning behind audio content and transcribes it using natural language processing (NLP). It can then automatically attach metadata so that search terms entered manually by users will surface relevant audio clips, which it calls “bursts.”

Audioburst recently earned $6.7 million in funding and also announced their new API. The API allows third-party developers to Audioburst’s content library to feature audio-based feeds in their own applications, in-car entertainment systems, and other connected devices. There is a growing demand for audio content as more people digest online information via sound bytes, use vocal searches, and make use of digital assistants.

It is easy to find “printed” information on the Internet, but finding a specific audio file is not. Audioburst hopes to revolutionize how people find and use sound. They should consider a partnership with Bitext because indexing audio could benefit from advanced linguistics. Bitext’s technology would make this application more accurate.

Whitney Grace, June 14, 2017

Written by Stephen E. Arnold · Filed Under AI, Bitext, Digital Assistant, News | Comments Off on AI Decides to Do the Audio Index Dance

Yandex Learns Search Can Be Exciting

June 6, 2017

I am not sure if this Thomson Reuters “real news” story is accurate. I found it amusing. You are on your own with this item, gentle reader.

I read “Investigators Search Ukrainian Offices of Russia’s Yandex.” The main point struck me as:

Ukraine’s State Security Service (SBU) raided the local offices of Russia’s top search site Yandex on Monday in an operation that SBU spokesman Olena Gitlyanska said was part of a treason investigation.

The operative word is treason. Exciting, right?

Yandex has previously said it operates fully in accordance with Ukrainian law. It does not expect sanctions to have a material negative impact on its business.

Let’s assume that the “real news” is accurate. The idea that a Web indexing company is guilty of treason is interesting. I know that in my word with a parent’s group to identify potentially harmful sites for their children, I use Yandex as an example.

Ukrainian officials did not reference Yandex’s more interesting indexing policies. That’s a shame. Treason may be more important to the Ukrainian government that links to certain interesting types of videos.

Treason can have a “material negative impact,” however.

Stephen E Arnold, June 5, 2017

Written by Stephen E. Arnold · Filed Under Legal matters, News, Search | 2 Comments

Antidot: Fluid Topics

June 5, 2017

I find French innovators creative. Over the years I have found the visualizations of DATOPS, the architecture of Exalead, the wonkiness of Kartoo, the intriguing Semio, and the numerous attempts to federate data and work flow like digital librarians and subject matter experts. The Descartes- and Femat-inspired engineers created software and systems which try to trim the pointy end off many information thorns.

I read “Antidot Enables ‘Interactive’ Tech Docs That Are Easier To Publish, More Relevant To Users – and Actually Get Read.” Antidot, for those not familiar with the company, was founded in 1999. Today the company bills itself as a specialist in semantic search and content classification. The search system is named Taruqa, and the classification component is called “Classifier.”

The Fluid Topics product combines a number of content processing functions in a workflow designed to provide authorized users with the right information at the right time.

According to the write up:

Antidot has updated its document delivery platform with new features aimed at making it easier to create user-friendly interactive docs. Docs are created and consumed thanks to a combination of semantic search, content enrichment, automatic content tagging and more.

The phrase “content enrichment” suggests to me that multiple indexing and metadata identification subroutines crunch on text. The idea is that a query can be expanded, tap into entity extraction, and make use of text analytics to identify documents which keyword matching would overlook.

The Fluid Topic angle is that documentation and other types of enterprise information can be indexed and matched to a user’s profile or to a user’s query. The result is that the needed document is findable.

The slicing and dicing of processed content makes it possible for the system to assemble snippets or complete documents into an “interactive document.” The idea is that most workers today are not too thrilled to get a results list and the job of opening, scanning, extracting, and closing links. The Easter egg hunt approach to finding business information is less entertaining than looking at Snapchat images or checking what’s new with pals on Facebook.

The write up states:

Users can read, search, navigate, annotate, create alerts, send feedback to writers, with a rich and intuitive user experience.

I noted this list of benefits fro the Fluid Topics’ approach:

Quick, easy access to the right information at the right time, making searching for technical product knowledge really efficient.
Combine and transform technical content into relevant, useful information by slicing and dicing data from virtually any source to create a unified knowledge hub.
Freedom for any user to tailor documentation and provide useful feedback to writers.
Knowledge of how documentation is actually used.

Applications include:

Casual publishing which means a user can create a “personal” book of content and share them.
Content organization which organizes the often chaotic and scattered source information
Markdown which means formatting information in a consistent way.

Fluid Topics is a hybrid which combines automatic indexing and metadata extraction, search, and publishing.

More information about Fluid Topics is available at a separate Antidot Web site called “Fluid Topics.” The company provides a video which explains how you can transform your world when you tackle search, customer support, and content federation and repurposing. Fluid Topics also performs text analytics for the “age of limitless technical content delivery.”

Hewlett Packard invested significantly in workflow based content management technology. MarkLogic’s XML data management system can be tweaked to perform similar functions. Dozens of other companies offer content workflow solutions. The sector is active, but sales cycles are lengthy. Crafty millennials can make Slack perform some content tricks as well. Those on a tight budget might find that Google’s hit and miss services are good enough for many content operations. For those in love with SharePoint, even that remarkable collection of fragmented services, APIs, and software can deliver good enough solutions.

I think it is worth watching how Antidot’s Fluid Topics performs in what strikes me as a crowded, volatile market for content federation and information workflow.

Stephen E Arnold, June 5, 2017

Written by Stephen E. Arnold · Filed Under Business strategy, Federated search, News, Text processing | Comments Off on Antidot: Fluid Topics

SEO Adapts to Rapidly Changing Algorithms

May 30, 2017

When we ponder the future of search, we consider factors like the rise of “smart” searching—systems that deliver what they know the user wants, instead of what the user wants—and how facial recognition search is progressing. Others look from different angles, though, like the business-oriented Inc., which shares the post, “What is the Future of Search?” Citing SEO expert Baruch Labunski, writer Drew Hendricks looks at how rapid changes to search engines’ ranking algorithms affect search-engine-optimization marketing efforts.

First, companies must realize that it is now essential that their sites play well with mobile devices; Google is making mobile indexing a priority. We learn that the rise of virtual assistants raises the stakes—voice-controlled searches only return the very first search result. (A reason, in my opinion, to use them sparingly for online searches.) The article pays the most attention, though, to addressing local search. Hendricks advises:

By combining the highly specific locational data that’s available from consumers searching on mobile, alongside Google’s already in-progress goal of customizing results by location for all users, positioning your brand to those who are physically near you will become crucial in 2017. …

Our jobs as brand managers and promoters will continue to become more complicated as time passes. The days of search engine algorithms filtering by obvious data points, or being easily manipulated, are over. The new fact of search engine optimization is appealing to your immediate markets – those around you and those who are searching directly for your product.

Listing one’s location(s) on myriad review sites and Google Places and placing the address on the company website are advised. The piece concludes by reassuring marketers that, as long as they make careful choices, they can successfully navigate the rapid changes to Google and other online search engines.

Cynthia Murrell, May 30, 2017

Written by Stephen E. Arnold · Filed Under algorithms, Google, Marketing, News, search engine | Comments Off on SEO Adapts to Rapidly Changing Algorithms

AI Not to Replace Lawyers, Not Yet

May 9, 2017

Robot or AI lawyers may be effective in locating relevant cases for references, but they are far away from replacing lawyers, who still need to go to the court and represent a client.

ReadWrite in a recently published analytical article titled Look at All the Amazing Things AI Can (and Can’t yet) Do for Lawyers says:

Even if AI can scan documents and predict which ones will be relevant to a legal case, other tasks such as actually advising a client or appearing in court cannot currently be performed by computers.

The author further explains that what the present generation of AI tools or robots does. They merely find relevant cases based on indexing and keywords, which was a time-consuming and cumbersome process. Thus, what robots do is eliminate the tedious work that was performed by interns or lower level employees. Lawyers still need to collect evidence, prepare the case and argue in the court to win a case. The robots are coming, but only for doing lower level jobs and not to snatch them.

Vishol Ingole, May 9, 2017

Written by Stephen E. Arnold · Filed Under AI, Analytics, Indexing, Legal matters, News | Comments Off on AI Not to Replace Lawyers, Not Yet

Image Search: Biased by Language. The Fix? Use Humans!

April 19, 2017

Houston, we (male, female, uncertain) have a problem. Bias is baked into some image analysis and just about every other type of smart software.

The culprit?

Numerical recipes.

The first step in solving a problem is to acknowledge that a problem exists. The second step is more difficult.

I read “The Reason Why Most of the Images That Show Up When You Search for Doctor Are White Men.” The headline identifies the problem. However, what does one do about biases rooted in human utterance.

My initial thought was to eliminate human utterances. No fancy dancing required. Just let algorithms do what algorithms do. I realized that although this approach has a certain logical completeness, implementation may meet with a bit of resistance.

What does the write up have to say about the problem? (Remember. The fix is going to be tricky.)

I learned:

Research from Princeton University suggests that these biases, like associating men with doctors and women with nurses, come from the language taught to the algorithm. As some data scientists say, “garbage in, garbage out”: Without good data, the algorithm isn’t going to make good decisions.

Okay, right coast thinking. I feel more comfortable.

What does the write up present as wizard Aylin Caliskan’s view of the problem? A post doctoral researcher seems to be a solid choice for a source. I assume the wizard is a human, so perhaps he, she, it is biased? Hmmm.

I highlighted in true blue several passages from the write up / interview with he, she, it. Let’s look at three statements, shall we?

Regarding genderless languages like Turkish:

when you directly translate, and “nurse” is “she,” that’s not accurate. It should be “he or she or it” is a nurse. We see that it’s making a biased decision—it’s a very simple example of machine translation, but given that these models are incorporated on the web or any application that makes use of textual data, it’s the foundation of most of these applications. If you search for “doctor” and look at the images, you’ll see that most of them are male. You won’t see an equal male and female distribution.

If accurate, this observation means that the “fix” is going to be difficult. Moving from a language without gender identification to a language with gender identification requires changing the target language. Easy for software. Tougher for a human. If the language and its associations are anchored in the brain of a target language speaker, change may be, how shall I say it, a trifle difficult. My fix looks pretty good at this point.

And what about images and videos? I learned:

Yes, anything that text touches. Images and videos are labeled to they can be used on the web. The labels are in text, and it has been shown that those labels have been biased.

And the fix is a human doing the content selection, indexing, and dictionary tweaking. Not so fast. The cost of indexing with humans is very expensive. Don’t believe me. Download 10,000 Wikipedia articles and hire some folks to index them from the controlled term list humans set up. Let me know if you can hit $17 per indexed article. My hunch is that you will exceed this target by several orders of magnitude. (Want to know where the number comes from? Contact me and we discuss a for fee deal for this high value information.)

How does the write up solve the problem? Here’s the capper:

…you cannot directly remove the bias from the dataset or model because it’s giving a very accurate representation of the world, and that’s why we need a specialist to deal with this at the application level.

Notice that my solution is to eliminate humans entirely. Why? The pipe dream of humans doing indexing won’t fly due to [a] time, [b] cost, [c] the massive flows of data to index. Forget the mother of all bombs.

Think about the mother of all indexing backlogs. The gap would make the Modern Language Association’s “gaps” look like weekend catch up party. Is this a job for the operating system for machine intelligence?

Stephen E Arnold, April 17, 2017

Written by Stephen E. Arnold · Filed Under AI, News, Text analytics, Text processing | Comments Off on Image Search: Biased by Language. The Fix? Use Humans!

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.

Categories
- 3D-Printing
- Acquisition
- Advertising
- Aggregation
- AI
- Alexa
- algorithms
- Amazon
- Amazonia
- Analytics
- Appliance
- Applications
- Audio
- Augmented Reality
- Big data
- Bing
- Bitcoin
- Bitext
- Book review
- Business intelligence
- Business process
- Business strategy
- Censorship
- Cloud computing
- Company Profile
- Conferences
- Connectors
- Consulting
- Consumer
- Content processing
- Copyright
- Corporate Concerns
- Cost
- Crawl
- Crowdfunding
- cryptocurrency
- Customer support
- Cyber OSINT
- cybercrime
- cybersecurity
- Dark Web
- DarkCyber
- Data
- Data mining
- Database
- Deepfakes
- Digital Assistant
- Digital Library
- E2EE
- ECommerce
- EDiscovery
- Editorial opinion
- Education
- Emoticons
- Enterprise
- Enterprise search
- Entity extraction
- Ethics
- Facebook
- Faceted search
- Factualities
- Feature
- Federated search
- Financial
- Fogint
- Google
- Governance
- Government
- Hackers
- healthcare
- IBM Watson
- Image search
- Indexing
- Infrastructure
- Innovation
- Integration
- intelware
- Interface
- Internet
- Interview
- Investment
- law enforcement
- Legal matters
- Library automation
- Management
- Marketing
- Mathematics
- Metadata
- Microsoft
- Mobile
- Natural language processing
- News
- NGIA
- Online (general)
- Open Access
- Open source
- OSINT
- Osint Radar
- Overflight
- Palantir
- Patents
- Personnel
- Podcast
- Policeware
- Portals
- Predictive coding
- Privacy
- Profile
- Publishing
- Quotation
- Real time search
- Reference tool
- Rich media
- Robot Writer
- Search
- Search enabled applications
- search engine
- Search quality
- Security
- Semantic
- Sentiment analysis
- SEO
- SharePoint
- Short Honks
- Smart Technology
- Social
- Social Media
- software
- Statistics
- Taxonomy
- Technology
- Text analytics
- Text processing
- Tools
- Tor
- Training
- Translation
- Twitter
- Uncategorized
- Unstructured Data
- User experience
- User Interface
- Vertical search
- Video
- visualization
- Voice search
- Voice technology
- Web 3
- Web Services
- Webinar
- Windows
- Work flow
- XML
- Yahoo

Beyond Search

ArnoldIT Publishes Technical Analysis of the Bitext Deep Linguistic Analysis Platform

IBM Watson: Predicting the Future

Can an Algorithm Tame Misinformation Online?

Editorial Controls and Data Governance: A Rose by Any Other Name?

AI Decides to Do the Audio Index Dance

Yandex Learns Search Can Be Exciting

Antidot: Fluid Topics

SEO Adapts to Rapidly Changing Algorithms

AI Not to Replace Lawyers, Not Yet

Image Search: Biased by Language. The Fix? Use Humans!

Search the site

Categories

Archives

Recent Posts

Meta

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Search the site

Categories

Archives

Recent Posts

Meta