Amusing Mistake Illustrates Machine Translation Limits

May 12, 2016

Machine translation is not quite perfect yet, but we’ve been assured that it will be someday. That’s the upshot of Business Insider’s piece, “This Microsoft Exec’s Hilarious Presentation Fail Shows Why Computer Translation is so Difficult.” Writer Matt Weinberger relates an anecdote shared by Microsoft research head Peter Lee. The misstep occurred during a 2015 presentation, for which Lee set up Skype Translator to translate his words over the speakers into Mandarin as he went. Weinberger writes:

“Part of Lee’s speech involved a personal story of growing up in a ‘snowy town’ in upper Michigan. He noticed that most of the crowd was enraptured — except for a few native Chinese speakers in the crowd who couldn’t stop giggling. After the presentation, Lee says he asked one of those Chinese speakers the reason for the laughter. It turns out that ‘snowy town’ translates into ‘Snow White’s Town.’ Which seems innocent enough, except that it turns out that ‘Snow White’s town’ is actually Chinese slang for ‘a town where a prostitute lives,’ Lee says. Whoops.

“Lee says it wasn’t caught in the profanity filters because there weren’t actually any bad words in the phrase. But it’s the kind of regional flavor where a direct translation of the words can’t bring across the meaning.”

Whoops indeed. The article notes that another problem with Skype Translator is its penchant for completely disregarding non-word utterances, like “um” and “ahh,” that often carry necessary meaning. We’re reminded, though, that these and other problems are expected to be ironed out within the next few years, according to Microsoft Research chief scientist Xuedong Huang. I wonder how many more amusing anecdotes will arise in the meantime.

Cynthia Murrell, May 12, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under Data, Microsoft, News, Security, Technology | Comments Off on Amusing Mistake Illustrates Machine Translation Limits

Billions in vc Funding Continues Rinse and Repeat Process

May 12, 2016

In the tech world, the word billion may be losing meaning for some. Pando published a recent editorial called, While the rest of tech struggles, so far VCs have raised more this quarter than in past three years. This piece calls attention to the seemingly never-ending list of VC firms raising ever-more funds. For example, Accel announced their funds were at $2 billion, Founders Fund raised $1 billion in new funds, and Andreessen Horowitz currently works to achieve another $1.5 billion. The author writes,

“It was hard to put that [recent fundraising rounds] in context. I mean, yeah. These are major funds. Is it news that they raised a collective $4.5 billion more at some point? Doesn’t mean they’ll invest it any more quickly. All it means is that the two will still be around for another ten years, which we kinda already guessed. It’s staggeringly hard for a venture fund to actually go out of business, even when it wasn’t some of the first money in Facebook or, in the case of Marc Andreessen, sits on its board. [Disclosure: Marc Andreessen, Founders Fund and Accel are all investors in Pando.]”

As the author wonders, asking Pitchbook if it’s a “bigger quarter than usual”, our eyebrows are not raised by this this thought, nor easy money, bubbles, unicorns. Nah, this is just routine in Sillycon Valley.

Megan Feil, May 12, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under Data, Facebook, News, Social Media, Technology | Comments Off on Billions in vc Funding Continues Rinse and Repeat Process

DARPA Seeks Keys to Peace with High-Tech Social Science Research

May 11, 2016

Strife has plagued the human race since the beginning, but the Pentagon’s research arm thinks may be able to get to the root of the problem. Defense Systems informs us, “DARPA Looks to Tap Social Media, Big Data to Probe the Causes of Social Unrest.” Writer George Leopold explains:

“The Defense Advanced Research Projects Agency (DARPA) announced this week it is launching a social science research effort designed to probe what unifies individuals and what causes communities to break down into ‘a chaotic mix of disconnected individuals.’ The Next Generation Social Science (NGS2) program will seek to harness steadily advancing digital connections and emerging social and data science tools to identify ‘the primary drivers of social cooperation, instability and resilience.’

“Adam Russell, DARPA’s NGS2 program manager, said the effort also would address current research limitations such as the technical and logistical hurdles faced when studying large populations and ever-larger datasets. The project seeks to build on the ability to link thousands of diverse volunteers online in order to tackle social science problems with implications for U.S. national and economic security.”

The initiative aims to blend social science research with the hard sciences, including computer and data science. Virtual reality, Web-based gaming, and other large platforms will come into play. Researchers hope their findings will make it easier to study large and diverse populations. Funds from NGS2 will be used for the project, with emphases on predictive modeling, experimental structures, and boosting interpretation and reproducibility of results.

Will it be the Pentagon that finally finds the secret to world peace?

Cynthia Murrell, May 11, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under AI, Big data, Data, News, Privacy, Security, Technology | Comments Off on DARPA Seeks Keys to Peace with High-Tech Social Science Research

Update from Lucene

May 10, 2016

It has been awhile since we heard about our old friend Apache Lucene, but the open source search engine has something new, says Open Source Connections in the article, “BM25 The Next Generation Of Lucene Relevance.” Lucene is added BM25 to its search software and it just might improve search results.

“BM25 improves upon TF*IDF. BM25 stands for “Best Match 25”. Released in 1994, it’s the 25th iteration of tweaking the relevance computation. BM25 has its roots in probabilistic information retrieval. Probabilistic information retrieval is a fascinating field unto itself. Basically, it casts relevance as a probability problem. A relevance score, according to probabilistic information retrieval, ought to reflect the probability a user will consider the result relevant.”

Apache Lucene formerly relied on TF*IDF, a way to rank how users value a text match relevance. It relied on two factors: term frequency-how often a term appeared in a document and inverse document frequency aka idf-how many documents the term appears and determines how “special” it is. BM25 improves on the old TF*IDF, because it gives negative scores for terms that have high document frequency. IDF in BM25 solves this problem by adding a 1 value, therefore making it impossible to deliver a negative value.

BM25 will have a big impact on Solr and Elasticsearch, not only improving search results and accuracy with term frequency saturation.

Whitney Grace, May 10, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under algorithms, Data, News, Search, Technology | 1 Comment

New Criminal Landscape Calls for New Approaches

May 9, 2016

The Oxford University Press’s blog discusses law enforcement’s interest in the shady side of the Internet in its post, “Infiltrating the Dark Web.” Writer Andrew Staniforth observes that the growth of crime on the Dark Web calls for new tactics. He writes:

“Criminals conducting online abuses, thefts, frauds, and terrorism have already shown their capacity to defeat Information Communication Technology (ICT) security measures, as well as displaying an indifference to national or international laws designed to stop them. The uncomfortable truth is that as long as online criminal activities remain profitable, the miscreants will continue, and as long as technology advances, the plotters and conspirators who frequent the Dark Web will continue to evolve at a pace beyond the reach of traditional law enforcement methods.

“There is, however, some glimmer of light amongst the dark projection of cybercrime as a new generation of cyber-cops are fighting back. Nowhere is this more apparent than the newly created Joint Cybercrime Action Taskforce (J-CAT) within Europol, who now provide a dynamic response to strengthen the fight against cybercrime within the European Union and beyond Member States borders. J-CAT seeks to stimulate and facilitate the joint identification, prioritisation, and initiation of cross-border investigations against key cybercrime threats and targets – fulfilling its mission to pro-actively drive intelligence-led actions against those online users with criminal intentions.”

The article holds up J-CAT as a model for fighting cybercrime. It also emphasizes the importance of allocating resources for gathering intelligence, and notes that agencies are increasingly focused on solutions that can operate in mobile and cloud environments. Increased collaboration, however, may make the biggest difference in the fight against criminals operating on the Dark Web.

Cynthia Murrell, April 9, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under Dark Web, News, Privacy, Security, Technology | Comments Off on New Criminal Landscape Calls for New Approaches

Wikipedia Relies on Crowdsourcing Once More

May 9, 2016

As a non-profit organization, the Wikimedia Foundation relies on charitable donations to fund many of its projects, including Wikipedia. It is why every few months, when you are browsing the Wiki pages you will see a donation bar pop to send them money. Wikimedia uses the funds to keep the online encyclopedia running, but also to start new projects. Engadget reports that Wikipedia is interested in taking natural language processing and applying it to the Wikipedia search engine, “Wikipedia Is Developing A Crowdsourced Speech Engine.”

Working with Sweden’s KTH Royal Institute of Technology, Wikimedia researchers are building a speech engine to enable people with reading or visual impairments to access the plethora of information housed in the encyclopedia. In order to fund the speech engine, the researchers turned to crowdsourcing. It is estimated that twenty-five percent, 125 million monthly users, will benefit from the speech engine.

” ‘Initially, our focus will be on the Swedish language, where we will make use of our own language resources,’ KTH speech technology professor Joakim Gustafson, said in a statement. ‘Then we will do a basic English voice, which we expect to be quite good, given the large amount of open source linguistic resources. And finally, we will do a rudimentary Arabic voice that will be more a proof of concept.’”

Wikimedia wants to have a speech engine in Arabic, English, and Swedish by the end of 2016, then they will focus on the other 280 languages they support with their projects. Usually, you have to pay to have an accurate and decent natural language processing machine, but if Wikimedia develops a decent speech engine it might not be much longer before speech commands are more commonplace.

Whitney Grace, May 9, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under AI, Content processing, Data, News, Search, Security, Technology | Comments Off on Wikipedia Relies on Crowdsourcing Once More

European Cybersecurity Companies

May 8, 2016

We’ve run across an interesting list of companies at Let’s Talk Payments, “Europe’s Elite Cybersecurity Club.” The bare-bones roster names and links to 28 cybersecurity companies, with a brief description of each. See the original for the descriptions, but here are their entries:

SpamTitan, Gemalto, Avira, itWatch, BT, Sophos, DFLabs, ImmuniWeb, Silent Circle, Deep-Secure, SentryBay , AVG Technologies, Clearswift, ESNC, DriveLock, BitDefender, neXus, Thales, Cryptovision, Secunia, Osirium, Qosmos, Digital Shadows, F-Secure, Smoothwall, Brainloop, TrulyProtect, and Enorasys Security Analytics

It is a fine list as far as it goes, but we notice it is not exactly complete. For example, where is FinFisher’s parent company, Gamma International? Still, the list is a concise and valuable source for anyone interested in learning more about these companies.

Cynthia Murrell, May 8, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under Data, Legal matters, News, Privacy, Security, Technology | Comments Off on European Cybersecurity Companies

How Hackers Hire

May 7, 2016

Ever wonder how hackers fill job openings, search-related or otherwise? A discussion at the forum tehPARADOX.COM considers, “How Hackers Recruit New Talent.” Poster MorningLightMountain cites a recent study by cybersecurity firm Digital Shadows, which reportedly examined around 100 million websites, both on the surface web and on the dark web, for recruiting practices. We learn:

“The researchers found that the process hackers use to recruit new hires mirrors the one most job-seekers are used to. (The interview, for example, isn’t gone—it just might involve some anonymizing technology.) Just like in any other industry, hackers looking for fresh talent start by exploring their network, says Rick Holland, the vice president of strategy at Digital Shadows. ‘Reputation is really, really key,’ Holland says, so a candidate who comes highly recommended from a trusted peer is off to a great start. When hiring criminals, reputation isn’t just about who gets the job done best: There’s an omnipresent danger that the particularly eager candidate on the other end of the line is actually an undercover FBI agent. A few well-placed references can help allay those fears.”

Recruiters, we’re told, frequently advertise on hacker forums. These groups reach many potential recruits and are often password-protected. However, it is pretty easy to trace anyone who logs into one without bothering to anonymize their traffic. Another option is to advertise on the dark web— researchers say they even found a “sort of Monster.com for cybercrime” there.

The post goes on to discuss job requirements, interviews, and probationary periods. We’re reminded that, no matter how many advanced cybersecurity tools get pushed to market, most attack are pretty basic; they involve approaches like denial-of-service and SQL injection. So, MorningLightMountain advises, any job-seeking hackers should be good to go if they just keep up those skills.

Cynthia Murrell, May 7, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under Data, News, Privacy, Search, Security, Technology | Comments Off on How Hackers Hire

A Not-For-Profit Search Engine? That’s So Crazy It Just Might Work

May 4, 2016

The Common Search Project has a simple and straightforward mission statement. They want a nonprofit search engine, an alternative to the companies currently running the Internet (ahem, Google.) They are extremely polite in their venture, but also firmly invested in three qualities for the search engine that they intend to build and run: openness, transparency, and independence. The core values include,

“Radical transparency. Our search results must be explainable and reproducible. All our code is open source and results are generated only using publicly available data. Transparency also extends to our governance, finances and day-to-day operations. Independence. No single person, company or special interest must be able to influence the order of our search results to their benefit. … Public service. We want to build and operate a free service targeted at a large, mainstream audience.”

Common Search currently offers a Demo version for searching homepages only. They are an exciting development compared to the other David’s who have swung at Google’s Goliath. Common Search makes DuckDuckGo, the search engine focused on ensuring user privacy, look downright half-assed. They are calling for, and creating, a real alternative with a completely fresh perspective that isn’t solely about meeting user needs, but insisting on user standards related to privacy, control, and clarity of results.

Chelsea Kerwin, May 4, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under Data, Google, News, Search, Technology | Comments Off on A Not-For-Profit Search Engine? That’s So Crazy It Just Might Work

Google Relies on Freebase Machine ID Numbers to Label Images in Knowledge Graph

May 3, 2016

The article on Seo by the Sea titled Image Search and Trends in Google Search Using FreeBase Entity Numbers explains the transformation occurring at Google around Freebase Machine ID numbers. Image searching is a complicated business when it comes to differentiating labels. Instead of text strings, Google’s Knowledge Graph is based in Freebase entities, which are able to uniquely evaluate images- without language. The article explains with a quote from Chuck Rosenberg,

“An entity is a way to uniquely identify something in a language-independent way. In English when we encounter the word “jaguar”, it is hard to determine if it represents the animal or the car manufacturer. Entities assign a unique ID to each, removing that ambiguity, in this case “/m/0449p” for the former and “/m/012×34” for the latter.”

Metadata is wonderful stuff, isn’t it? The article concludes by crediting Barbara Starr, a co-administrator of the Lotico San Diego Semantic Web Meetup, with noticing that the Machine ID numbers assigned to Freebase entities now appear in Google Trend’s URLs. Google Trends is a public web facility that enables an exploration of the hive mind by showing what people are currently searching. The Wednesday that President Obama nominated a new Supreme Court Justice, for example, had the top search as Merrick Garland.

Chelsea Kerwin, May 3, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under Data, Google, News, Search, Technology, User experience | Comments Off on Google Relies on Freebase Machine ID Numbers to Label Images in Knowledge Graph

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.

Categories
- 3D-Printing
- Acquisition
- Advertising
- Aggregation
- AI
- Alexa
- algorithms
- Amazon
- Amazonia
- Analytics
- Appliance
- Applications
- Audio
- Augmented Reality
- Big data
- Bing
- Bitcoin
- Bitext
- Book review
- Business intelligence
- Business process
- Business strategy
- Censorship
- Cloud computing
- Company Profile
- Conferences
- Connectors
- Consulting
- Consumer
- Content processing
- Copyright
- Corporate Concerns
- Cost
- Crawl
- Crowdfunding
- cryptocurrency
- Customer support
- Cyber OSINT
- cybercrime
- cybersecurity
- Dark Web
- DarkCyber
- Data
- Data mining
- Database
- Deepfakes
- Digital Assistant
- Digital Library
- E2EE
- ECommerce
- EDiscovery
- Editorial opinion
- Education
- Emoticons
- Enterprise
- Enterprise search
- Entity extraction
- Ethics
- Facebook
- Faceted search
- Factualities
- Feature
- Federated search
- Financial
- Google
- Governance
- Government
- Hackers
- healthcare
- IBM Watson
- Image search
- Indexing
- Infrastructure
- Innovation
- Integration
- intelware
- Interface
- Internet
- Interview
- Investment
- law enforcement
- Legal matters
- Library automation
- Management
- Marketing
- Mathematics
- Metadata
- Microsoft
- Mobile
- Natural language processing
- News
- NGIA
- Online (general)
- Open Access
- Open source
- OSINT
- Osint Radar
- Overflight
- Palantir
- Patents
- Personnel
- Podcast
- Policeware
- Portals
- Predictive coding
- Privacy
- Profile
- Publishing
- Quotation
- Real time search
- Reference tool
- Rich media
- Robot Writer
- Search
- Search enabled applications
- search engine
- Search quality
- Security
- Semantic
- Sentiment analysis
- SEO
- SharePoint
- Short Honks
- Smart Technology
- Social
- Social Media
- software
- Statistics
- Taxonomy
- Technology
- Text analytics
- Text processing
- Tools
- Tor
- Training
- Translation
- Twitter
- Uncategorized
- Unstructured Data
- User experience
- User Interface
- Vertical search
- Video
- visualization
- Voice search
- Voice technology
- Web 3
- Web Services
- Webinar
- Windows
- Work flow
- XML
- Yahoo

Beyond Search

Amusing Mistake Illustrates Machine Translation Limits

Billions in vc Funding Continues Rinse and Repeat Process

DARPA Seeks Keys to Peace with High-Tech Social Science Research

Update from Lucene

New Criminal Landscape Calls for New Approaches

Wikipedia Relies on Crowdsourcing Once More

European Cybersecurity Companies

How Hackers Hire

A Not-For-Profit Search Engine? That’s So Crazy It Just Might Work

Google Relies on Freebase Machine ID Numbers to Label Images in Knowledge Graph

Search the site

Categories

Archives

Recent Posts

Meta

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Search the site

Categories

Archives

Recent Posts

Meta