Searching Video and Audio Files is Now Easier Than Ever

February 7, 2018

While text-based search has been honed to near perfection in recent years, video and audio search still lags. However, a few companies are really beginning to chip away at this problem. One that recently caught our attention was VidDistill, a company that distills YouTube videos into an indexed list.

According to their website:

vidDistill first gets the video and captions from YouTube based off of the URL the user enters. The caption text is annotated with the time in the video the text corresponds to. If manually provided captions are available, vidDistill uses those captions. If manually provided captions are not available, vidDistill tries to fall back on automatically generated captions. If no captioning of any sort is available, then vidDistill will not work.

 

Once vidDistill has the punctuated text, it uses a text summarization algorithm to identify the most important sentences of the entire transcript of the video. The text summarization algorithm compresses the text as much as the user specifies.

It was interesting and did what they claimed, however, we wish you could search for words and have it brought up in the index so users could skip directly to specific parts of a video. This technology has been done in audio, quite well. A service called Happy Scribe, which is aimed at journalists transcribing audio notes, takes an audio file and (for a small fee) transcribes it to text, which can then be searched. It’s pretty elegant and fairly accurate, depending on the audio quality. We could see VidDistill using this mentality to great success.

Patrick Roland, February 7, 2018

DarkCyber for January 30, 2018, Now Available

January 30, 2018

DarkCyber for January 30, 2018, is now available at www.arnoldit.com/wordpress and on Vimeo at www.vimeo.com at https://vimeo.com/253109084.

This week’s program looks at the 4iq discovery of more than one billion user names and passwords. The collection ups the ante for stolen data. The Dark Web database contains a search system and a “how to” manual for bad actors. 4iq, a cyber intelligence specialist, used its next-generation system to locate and analyze the database.

Stephen E Arnold said:

“The technology powering 4iq combines sophisticated data acquisition with intelligent analytics. What makes 4iq’s approach interesting is that the company integrates trained intelligence analysts in its next-generation approach. The discovery of the user credentials underscores the importance of 4iq’s method and the rapidly rising stakes in online access.”

DarkCyber discusses “reputation scores” for Dark Web contraband sites. The systems emulate the functionality of Amazon and eBay-style vendor report cards.

Researchers in Germany have demonstrated one way to compromise WhatsApp secure group chat sessions. With chat and alternative communication channels becoming more useful to bad actors than Dark Web forums and Web sites, law enforcement and intelligence professionals seek ways to gather evidence.

DarkCyber points to a series of Dark Web reviews. The sites can be difficult to locate using Dark Web search systems and postings on pastesites. One of the identified Dark Web sites makes use of a hosting service in Ukraine.

About DarkCyber

DarkCyber is one of the few video news programs which presents information about the Dark Web and lesser known Internet services. The information in the program comes from research conducted for the second edition of “Dark Web Notebook” and from the information published in Beyond Search, a free Web log focused on search and online services. The blog is now in its 10th year of publication, and the backfile consists of more than 15,000 stories.

 

Kenny Toth, January 30, 2018

Transcribing Podcasts with Help from Amazon

January 19, 2018

I enjoy walking the dog and listening to podcasts. However, I read more quickly than I listen. Speed up is a feature which works well for those in their mid 20s. At age 74, not so much.

Few podcasts create transcripts. Kudos to Steve Gibson at Security Now. He pays for this work himself because other podcasts on the Twit network don’t offer much in the way of transcripts. And in the case of This Week in Law, there aren’t weekly programs. Recently, no programs. Helpful, no?

You can get the basics of the transcriptions produced by Amazon Transcribe in “Podcast Transcription with Amazon Transcribe.”

One has to be a programmer to use the service. Here’s the passage in the write up I highlighted:

The first thing that I would want out of this is speaker detection, i.e. knowing how many different speakers there are and to be able to differentiate their voices. Podcasts typically have more than one host, or a host and a guest for an interview, so that would be helpful. Also, it would be great to be able to send back corrections on words somehow, to help with the training. I’m sure Amazon has a pretty good thing going, but maybe on an account level? Or for proper nouns? I still think it would be good for people to provide that feedback.

Perhaps the podcast transcript void can be filled—at long last.

Stephen E Arnold, January 19, 2018

Google Search: More Digital Gutenberg Action

December 24, 2017

Years ago I wrote “Google: The Digital Gutenberg.” The point of the monograph was to call attention to the sheer volume of content which Google generates. Few people outside of my circle of clients who paid for the analysis took much notice.

I spotted this article in my stream of online content. “Google Search Updates Take a Personalized Turn” explains that a Google search for oneself – what some folks call an egosearch – returns a list of results with a bubblegum card about the person. (A bubblegum card is intel jargon for a short snapshot of a person of interest.)

The publishing angle – hence the connection to Gutenberg – is that the write up reports the person who does an egosearch can update the information about oneself.

A number of interesting angles sparkle from this gem of converting search into someone more “personal.” What’s interesting is that the functionality reaches back to the illustration of a bubblegum card about Michael Jackson which appears in US20070198481. Here’s an annotated patent document snippet from one of my for-fee Google lectures which I was giving in the 2006 to 2009 time period:

image

Some information professionals will recognize this as an automated bubble-gum card complete with aliases, personal details, last known location, etc. If you have money to spend, there are a number of observations my research team formulated  about this “personalization” capability.

I liked this phrase in the Scalzi write up: “pretty deep into the Google ecosystem.” Nope, there is much more within the Google content parsing and fusion system. Lots, lots more for “Automatic Object Reference Identification and Linking in a Browseable Fact Repository.”

Keep in mind that this is just one output from the digital Gutenberg which sells ads, delivers free to you and me online search, and tries to solve death and other interesting genetic issues.

Stephen E Arnold, December 24, 2017

Progress: From Selling NLP to Providing NLP Services

December 11, 2017

Years ago, Progress Software owned an NLP system. I recall conversations with natural language processing wizards from Easy Ask. Larry Harris developed a natural language system in 1999 or 2000. Progress purchased EasyAsk in 2005 if memory serves. I interviewed Craig Bassin in 2010 as part of my Search Wizards Speak series.

The recollection I have was that Progress divested itself of EasyAsk in order to focus on enterprise applications other than NLP. No big deal. Software companies are bought and sold everyday.

However, what makes this recollection interesting to me is the information in “Beyond NLP: 8 Challenges to Building a Chatbot.” Progress went from a software company who owned an NLP system to a company which is advising people like me how challenging a chatbot system can be to build and make work. (I noted that the Wikipedia entry for Progress does not mention the EasyAsk acquisition and subsequent de-acquisition.) Either small potatoes or a milestone best jumped over I assume.)

Presumably it is easier to advise and get paid to implement than funding and refining an NLP system like EasyAsk. If you are not familiar with EasyAsk, the company positions itself in eCommerce site search with its “cognitive eCommerce” technology. EasyAsk’s capabilities include voice enabled natural language mobile search. This strikes me as a capability which is similar to that of a chatbot as I understand the concept.

History is history one of my high school teachers once observed. Let’s move on.

What are the eight challenges to standing up a chatbot which sort of works? Here they are:

  1. The chat interface
  2. NLP
  3. The “context” of the bot
  4. Loops, splits, and recursions
  5. Integration with legacy systems
  6. Analytics
  7. Handoffs
  8. Character, tone, and persona.

As I review this list, I note that I have to decide whether to talk to a chatbot or type into a box so a “customer care representative” can assist me. The “representative” is, the assumption is, a smart software robot.

I also notice that the bot has to have context. Think of a car dealer and the potential customer. The bot has to know that I want to buy a car. Seems obvious. But okay.

“Loops, splits, and recursions.” Frankly I have no idea what this means. I know that chatbot centric companies use jargon. I assume that this means “programming” so the NLP system returns a semi-on point answer.

Integration with legacy systems and handoffs seem to be similar to me. I would just call these two steps “integration” and be done with it.

The “character, tone, and persona” seems to apply to how the chatbot sounds; for example, the nasty, imperious tone of a Kroger automated check out system.

Net net: Progress is in the business of selling advisory and engineering services. The reason, in my opinion, was that Progress could not crack the code to make search and retrieval generate expected payoffs. Like some Convera executives, selling search related services was a more attractive path.

Stephen E Arnold, December 11, 2017

Google Search and Hot News: Sensitivity and Relevance

November 10, 2017

I read “Google Is Surfacing Texas Shooter Misinformation in Search Results — Thanks Also to Twitter.” What struck me about the article was the headline; specifically, the implication for me was that Google was not responding to user queries. Google is actively “surfacing” or fetching and displaying information about the event. Twitter is also involved. I don’t think of Twitter as much more than a party line. One can look up keywords or see a stream of content containing a keyword or a, to use Twitter speak, “hash tags.”

The write up explains:

Users of Google’s search engine who conduct internet searches for queries such as “who is Devin Patrick Kelley?” — or just do a simple search for his name — can be exposed to tweets claiming the shooter was a Muslim convert; or a member of Antifa; or a Democrat supporter…

I think I understand. A user inputs a term and Google’s system matches the user’s query to the content in the Google index. Google maintains many indexes, despite its assertion that it is a “universal search engine.” One has to search across different Google services and their indexes to build up a mosaic of what Google has indexed about a topic; for example, blogs, news, the general index, maps, finance, etc.

Developing a composite view of what Google has indexed takes time and patience. The results may vary depending on whether the user is logged in, searching from a particular geographic location, or has enabled or disabled certain behind the scenes functions for the Google system.

The write up contains this statement:

Safe to say, the algorithmic architecture that underpins so much of the content internet users are exposed to via tech giants’ mega platforms continues to enable lies to run far faster than truth online by favoring flaming nonsense (and/or flagrant calumny) over more robustly sourced information.

From my point of view, the ability to figure out what influences Google’s search results requires significant effort, numerous test queries, and recognition that Google search now balances on two pogo sticks. Once “pogo stick” is blunt force keyword search. When content is indexed, terms are plucked from source documents. The system may or may not assign additional index terms to the document; for example, geographic or time stamps.

The other “pogo stick” is discovery and assignment of metadata. I have explained some of the optional tags which Google may or may not include when processing a content object; for example, see the work of Dr. Alon Halevy and Dr. Ramanathan Guha.

But Google, like other smart content processing today, has a certain sensitivity. This means that streams of content processed may contain certain keywords.

When “news” takes place, the flood of content allows smart indexing systems to identify a “hot topic.” The test queries we ran for my monographs “The Google Legacy” and “Google Version 2.0” suggest that Google is sensitive to certain “triggers” in content. Feedback can be useful; it can also cause smart software to wobble a bit.

Image result for the impossible takes a little longer

T shirts are easy; search is hard.

I believe that the challenge Google faces is similar to the problem Bing and Yandex are exploring as well; that is, certain numerical recipes can over react to certain inputs. These over reactions may increase the difficulty of determining what content object is “correct,” “factual,” or “verifiable.”

Expecting a free search system, regardless of its owner, to know what’s true and what’s false is understandable. In my opinion, making this type of determination with today’s technology, system limitations, and content analysis methods is impossible.

In short, the burden of figuring out what’s right and what’s not correct falls on the user, not exclusively on the search engine. Users, on the other hand, may not want the “objective” reality. Search vendors want traffic and want to generate revenue. Algorithms want nothing.

Mix these three elements and one takes a step closer to understanding that search and retrieval is not the slam dunk some folks would have me believe. In fact, the sensitivity of content processing systems to comparatively small inputs requires more discussion. Perhaps that type of information will come out of discussions about how best to deal with fake news and related topics in the context of today’s information retrieval environment.

Free search? Think about that too.

Stephen E Arnold, November 10, 2017

Understanding Intention: Fluffy and Frothy with a Few Factoids Folded In

October 16, 2017

Introduction

One of my colleagues forwarded me a document called “Understanding Intention: Using Content, Context, and the Crowd to Build Better Search Applications.” To get a copy of the collateral, one has to register at this link. My colleague wanted to know what I thought about this “book” by Lucidworks. That’s what Lucidworks calls the 25 page marketing brochure. I read the PDF file and was surprised at what I perceived as fluff, not facts or a cohesive argument.

image

The topic was of interest to my colleague because we completed a five month review and analysis of “intent” technology. In addition to two white papers about using smart software to figure out and tag (index) content, we had to immerse ourselves in computational linguistics, multi-language content processing technology, and semantic methods for “making sense” of text.

The Lucidworks’ document purported to explain intent in terms of content, context, and the crowd. The company explains:

With the challenges of scaling and storage ticked off the to-do list, what’s next for search in the enterprise? This ebook looks at the holy trinity of content, context, and crowd and how these three ingredients can drive a personalized, highly-relevant search experience for every user.

The presentation of “intent” was quite different from what I expected. The details of figuring out what content “means” were sparse. The focus was not on methodology but on selling integration services. I found this interesting because I have Lucidworks in my list of open source search vendors. These are companies which repackage open source technology, create some proprietary software, and assist organizations with engineering and integrating services.

The book was an explanation anchored in buzzwords, not the type of detail we expected. After reading the text, I was not sure how Lucidworks would go about figuring out what an utterance might mean. The intent-centric systems we reviewed over the course of five months followed several different paths.

Some companies relied upon statistical procedures. Others used dictionaries and pattern matching. A few combined multiple approaches in a content pipeline. Our client, a firm based in Madrid, focused on computational linguistics plus a series of procedures which combined proprietary methods with “modules” to perform specific functions. The idea for this approach was to reduce the errors in intent identification from accuracy between 65 percent to 80 percent to accuracy approaching and often exceeding 90 percent. For text processing in multi-language corpuses, the Spanish company’s approach was a breakthrough.

I was disappointed but not surprised that Lucidworks’ approach was breezy. One of my colleagues used the word “frothy” to describe the information in the “Understanding Intention” document.

As I read the document, which struck me as a shotgun marriage of generalizations and examples of use cases in which “intent” was important, I made some notes.

Let me highlight five of the observations I made. I urge you to read the original Lucidworks’ document so you can judge the Lucidworks’ arguments for yourself.

Imitation without Attribution

My first reaction was that Lucidworks had borrowed conceptually from ideas articulated by Dr. Gregory Grefenstette and his book Search Based Applications: At the Confluence of Search and Database Technologies. You can purchase this 2011 book on Amazon at this link. Lucidworks’ approach, unlike Dr. Grefenstette’s borrowed some of the analysis but did not include the detail which supports the increasing importance of using search as a utility within larger information access solutions. Without detail, the Lucidworks’ document struck me as a description of the type of solutions that a company like Tibco is now offering its customers.

Read more

AI Predictions for 2018

October 11, 2017

AI just keeps gaining steam, and is positioned to be extremely influential in the year to come. KnowStartup describes “10 Artificial Intelligence (AI) Technologies that Will Rule 2018.” Writer Biplab Ghosh introduces the list:

Artificial Intelligence is changing the way we think of technology. It is radically changing the various aspects of our daily life. Companies are now significantly making investments in AI to boost their future businesses. According to a Narrative Science report, just 38% percent of the companies surveys used artificial intelligence in 2016—but by 2018, this percentage will increase to 62%. Another study performed by Forrester Research predicted an increase of 300% in investment in AI this year (2017), compared to last year. IDC estimated that the AI market will grow from $8 billion in 2016 to more than $47 billion in 2020. ‘Artificial Intelligence’ today includes a variety of technologies and tools, some time-tested, others relatively new.

We are not surprised that the top three entries are natural language generation, speech recognition, and machine learning platforms, in that order. Next are virtual agents (aka “chatbots” or “bots”), then decision management systems, AI-optimized hardware, deep learning platforms, robotic process automation, text analytics & natural language processing, and biometrics. See the write-up for details on each of these topics, including some top vendors in each space.

Cynthia Murrell, October 11, 2017

New Beyond Search Overflight Report: The Bitext Conversational Chatbot Service

September 25, 2017

Stephen E Arnold and the team at Arnold Information Technology analyzed Bitext’s Conversational Chatbot Service. The BCBS taps Bitext’s proprietary Deep Linguistic Analysis Platform to provide greater accuracy for chatbots regardless of platform.

Arnold said:

The BCBS augments chatbot platforms from Amazon, Facebook, Google, Microsoft, and IBM, among others. The system uses specific DLAP operations to understand conversational queries. Syntactic functions, semantic roles, and knowledge graph tags increase the accuracy of chatbot intent and slotting operations.

One unique engineering feature of the BCBS is that specific Bitext content processing functions can be activated to meet specific chatbot applications and use cases. DLAP supports more than 50 languages. A BCBS licensee can activate additional language support as needed. A chatbot may be designed to handle English language queries, but Spanish, Italian, and other languages can be activated with via an instruction.

Dr. Antonio Valderrabanos said:

People want devices that understand what they say and intend. BCBS (Bitext Chatbot Service) allows smart software to take the intended action. BCBS allows a chatbot to understand context and leverage deep learning, machine intelligence, and other technologies to turbo-charge chatbot platforms.

Based on ArnoldIT’s test of the BCBS, accuracy of tagging resulted in accuracy jumps as high as 70 percent. Another surprising finding was that the time required to perform content tagging decreased.

Paul Korzeniowski, a member of the ArnoldIT study team, observed:

The Bitext system handles a number of difficult content processing issues easily. Specifically, the BCBS can identify negation regardless of the structure of the user’s query. The system can understand double intent; that is, a statement which contains two or more intents. BCBS is one of the most effective content processing systems to deal correctly  with variability in human statements, instructions, and queries.

Bitext’s BCBS and DLAP solutions deliver higher accuracy, and enable more reliable sentiment analyses, and even output critical actor-action-outcome content processing. Such data are invaluable for disambiguating in Web and enterprise search applications, content processing for discovery solutions used in fraud detection and law enforcement and consumer-facing mobile applications.

Because Bitext was one of the first platform solution providers, the firm was able to identify market trends and create its unique BCBS service for major chatbot platforms. The company focuses solely on solving problems common to companies relying on machine learning and, as a result, has done a better job delivering such functionality than other firms have.

A copy of the 22 page Beyond Search Overflight analysis is available directly from Bitext at this link on the Bitext site.

Once again, Bitext has broken through the barriers that block multi-language text analysis. The company’s Deep Linguistics Analysis Platform supports more than 50 languages at a lexical level and +20 at a syntactic level and makes the company’s technology available for a wide range of applications in Big Data, Artificial Intelligence, social media analysis, text analytics,  and the new wave of products designed for voice interfaces supporting multiple languages, such as chatbots. Bitext’s breakthrough technology solves many complex language problems and integrates machine learning engines with linguistic features. Bitext’s Deep Linguistics Analysis Platform allows seamless integration with commercial, off-the-shelf content processing and text analytics systems. The innovative Bitext’s system reduces costs for processing multilingual text for government agencies and commercial enterprises worldwide. The company has offices in Madrid, Spain, and San Francisco, California. For more information, visit www.bitext.com.

Kenny Toth, September 25, 2017

Lucidworks: The Future of Search Which Has Already Arrived

August 24, 2017

I am pushing 74, but I am interested in the future of search. The reason is that with each passing day I find it more and more difficult to locate the information I need as my routine research for my books and other work. I was anticipating a juicy read when I requested a copy of “Enterprise Search in 2025.” The “book” is a nine page PDF. After two years of effort and much research, my team and I were able to squeeze the basics of Dark Web investigative techniques into about 200 pages. I assumed that a nine-page book would deliver a high-impact payload comparable to one of the chapters in one of my books like CyberOSINT or Dark Web Notebook.

I was surprised that a nine-page document was described as a “book.” I was quite surprised by the Lucidworks’ description of the future. For me, Lucidworks is describing information access already available to me and most companies from established vendors.

The book’s main idea in my opinion is as understandable as this unlabeled, data-free graphic which introduces the text content assembled by Lucidworks.

image

However, the pamphlet’s text does not make this diagram understandable to me. I noted these points as I worked through the basic argument that client server search is on the downturn. Okay. I think I understand, but the assertion “Solr killed the client-server stars” was interesting. I read this statement and highlighted it:

Other solutions developed, but the Solr ecosystem became the unmatched winner of the search market. Search 1.0 was over and Solr won.

In the world of open source search, Lucene and Solr have gained adherents. Based on the information my team gathered when we were working on an IDC open source search project, the dominant open source search system was Lucene. If our data were accurate when we did the research, Elastic’s Elasticsearch had emerged as the go-to open source search system. The alternatives like Solr and Flaxsearch have their users and supporters, but Elastic, founded by Shay Branon, was a definite step up from his earlier search service called Compass.

In the span of two and a half years, Elastic had garnered more than a $100 million in funding by 2014and expanded into a number adjacent information access market sectors. Reports I have received from those attending Elastic meetings was that Elastic was putting considerable pressure on proprietary search systems and a bit of a squeeze on Lucidworks. Google’s withdrawing its odd duck Google Search Appliance may have been, in small part, due to the rise of Elasticsearch and the changes made by organizations trying to figure out how to make sense of the digital information to which their staff had access.

But enough about the Lucene-Solr and open source versus proprietary search yin and yang tension.

Read more

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta