Natural Language Processing App Gains Increased Vector Precision

March 1, 2016

For us, concepts have meaning in relationship to other concepts, but it’s easy for computers to define concepts in terms of usage statistics. The post Sense2vec with spaCy and Gensim from SpaCy’s blog offers a well-written outline explaining how natural language processing works highlighting their new Sense2vec app. This application is an upgraded version of word2vec which works with more context-sensitive word vectors. The article describes how this Sense2vec works more precisely,

“The idea behind sense2vec is super simple. If the problem is that duck as in waterfowl andduck as in crouch are different concepts, the straight-forward solution is to just have two entries, duckN and duckV. We’ve wanted to try this for some time. So when Trask et al (2015) published a nice set of experiments showing that the idea worked well, we were easy to convince.

We follow Trask et al in adding part-of-speech tags and named entity labels to the tokens. Additionally, we merge named entities and base noun phrases into single tokens, so that they receive a single vector.”

Curious about the meta definition of natural language processing from SpaCy, we queried natural language processing using Sense2vec. Its neural network is based on every word on Reddit posted in 2015. While it is a feat for NLP to learn from a dataset on one platform, such as Reddit, what about processing that scours multiple data sources?

 

Megan Feil, March 1, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

LinkedIn: Looking for Its Next Gig?

February 29, 2016

I signed up for the free LinkedIn years ago. I don’t do too much LinkedIn surfing. I do delete the email I get from the company. I had one of the goslings post a list of my articles to see what would happen. (Results of the test: Nothing happened.) I find it amusing that marketers and PR “professionals” want to be my LinkedIn contact. I used to write these folks and ask, “Why do you want to be my LinkedIn friend?” (Results of the test: No one writes back.) Now you know why I don’t do much LinkedIn surfing. No, I don’t read the musings of the firm’s “thought leaders.”

I did read “LinkedIn Problems Run Deeper Than Valuation.” The write up informed me of this interesting “assertion”:

The problem stems from each of the company’s revenue streams, which ultimately diminish the business value of using the service. Whether it’s being paid to promote content, focusing on sales and recruitment over other professions, or interruptive advertising, these streams incentivize poor behavior by individual users on the site.

I like that “poor behavior” and the incentive angle. The concrete foundation of LinkedIn, it seems to me, is spam.

The company, according to the write up, has a reason to face each day with a big smile:

The company still has assets that are the envy of any tech company — a vast user base and a wealth of content to exploit.

As Yahoo’s publishing experiment demonstrates, content may not be enough.

I think the larger issue is the fact that social networks often lose their stickiness after a period of time. Google’s social efforts seem to mirror the challenges of MySpace. LinkedIn may find itself trapped by its own job hunting system choked with marketers’ leading thoughts.

Why not drive for Uber, Lyft, or Amazon? Less spam and probably a shorter path to some real cash. By the way, did you ever try to locate something using the company’s search engine? Quite a piece of work is that.

Stephen E Arnold, February 29, 2016

New Tor Communication Software for Journalists and Sources Launches

February 29, 2016

A new one-to-one messaging tool for journalists has launched after two years in development. The article Ricochet uses power of the dark web to help journalists, sources dodge metadata laws from The Age describes this new darknet-based software. The unique feature of this software, Ricochet, in comparison to others used by journalists such as Wickr, is that it does not use a server but rather Tor. Advocates acknowledge the risk of this Dark Web software being used for criminal activity but assert the aim is to provide sources and whistleblowers an anonymous channel to securely release information to journalists without exposure. The article explains,

“Dr Dreyfus said that the benefits of making the software available would outweigh any risks that it could be used for malicious purposes such as cloaking criminal and terrorist operations. “You have to accept that there are tools, which on balance are a much greater good to society even though there’s a tiny possibility they could be used for something less good,” she said. Mr Gray argued that Ricochet was designed for one-to-one communications that would be less appealing to criminal and terrorist organisers that need many-to-many communications to carry out attacks and operations. Regardless, he said, the criminals and terrorists had so many encryption and anonymising technologies available to them that pointing fingers at any one of them was futile.”

Online anonymity is showing increasing demand as evidenced through the recent launch of several new Tor-based softwares like Ricochet, in addition to Wickr and consumer-oriented apps like Snapchat. The Dark Web’s user base appears to be growing and diversifying. Will public perception follow suit?

 

Megan Feil, February 29, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Is Bing Full of Bugs or Is Constant Change And “Agility” the Wave of the Future?

February 29, 2016

The article titled  600 Engineers Make 4,000 Changes to Bing Each Week on WinBeta goes behind the scenes of a search engine. The title seems to suggest that Bing is a disaster with so many bugs that only a fleet of engineers working around the clock can manage the number of bugs in the system. That is actually far from the impression that the article makes. Instead, it stresses the constant innovation that Bing calls “Continuous Delivery” or “Agility.” The article states,

“How about the 600 engineers mentioned above pushing more than 4,000 individual changes a week into a testing phase containing over 20,000 tests. Each test can last from 10 minutes to several hours or days… Agility incorporates two “loops,” the Inner Loop that is where engineers write the code, prototype, and crowd-source features. Then, there’s an Outer Loop where the code goes live, gets tested by users, and then pushes out to the world.”

For more details on the sort of rapid and creative efforts made possible by so many engineers, check out the Bing Visual Blog Post created by a Microsoft team. The article also reminds us that Bing is not only a search engine, but also the life-force behind Microsoft’s Cortana, as well as being integrated into Misrosoft Office 2016, AOL and Siri.

 

Chelsea Kerwin, February 29, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Alphabet Google Search: Dominance Ending?

February 28, 2016

I read “Will PageRank Expiration Threaten Google’s Dominance.” The main point is that Google’s shift to artificial intelligence “hurt Google Search’s market share and its stock price?”

The write up references the 1997 write up about the search engine’s core algorithms. (There is no reference to the work by Jon Kleinberg and the Clever system, which is understandable I suppose.) Few want to view Google as a me-too outfit, “cleverly” overlooking the firm’s emulation strategy. Think GoTo.com/Overture/Yahoo in the monetization mechanism.

The write up states:

The Google Search today transcends PageRank: Google has a myriad of proprietary technology.

I agree. Google is not an open source centric outfit. When was the last time, Google made it easy to locate its employees’ technical papers, presentations at technical conferences, or details about products and services which just disappear. Orkut, anyone?

The write up shifts its focus to some governance issues; for example, Google’s Loon balloon, solving death, etc. There is a reference to Google’s strategy concerning mobile phones.

Stakeholders may want to worry because Google is dependent on search for the bulk of its revenues. I learned:

From Alphabet’s recent 10-k and Google’s Search revenues from Statista, you will realize that Search has been ~92%, ~90%, ~90% of total revenues in 2013-2015 respectively.

No big news here.

The core challenge for analysts will be to figure out if a shift to artificial intelligence methods for search will have unforeseen consequences. For example, maybe Google has figured out that the cost of indexing the Web is too expensive. AI may be a way to reduce costs of indexing and serving results. Google may realize that the shift from desktop based queries to mobile queries threatens Google’s ability to deliver information with the same perceived relevance that the desktop experience created in users’ perceptions.

Alphabet Google is at a cross road. The desktop model from the late 1990s is less and less relevant in 2016. Like any other company faced with change, Google’s executives find themselves in the same boat as other online vendors. Today’s revenues may not be the revenues of tomorrow.

Will Alphabet Google face the information headwinds which buffeted Autonomy, Convera, Endeca, Fast Search & Transfer, and similar information access vendors? Is Google facing a long walk down the path which America Online and Yahoo followed? Will the one trick revenue pony die when it cannot adapt to the mobile jungle?

Good questions. Answers? Tough to find.

Stephen E Arnold, February 28, 2016

Around Paywalls? Probably Not Spot On

February 27, 2016

I read “How Google’s Web Crawler Bypasses Paywalls.” I am not confident the write up is spot in. You may find the information useful in your own efforts to do the Connotate-type or Kimono-type thing.

The outfit with the paywall tunnel, according to the write up, is Alphabet’s Google unit. Talk about the tail wagging the dog.

The write up points out that the method uses Referer and User –Agent headers.

The approach is detailed in the article via code snippets. It’s in the cards, so have at it.

Oh, there may be other methods in play, but I will leave you to your experimentation.

Stephen E Arnold, February 23, 2016

DuckDuckGo: Challenging Google Is Not a Bad Idea

February 25, 2016

I read “The Founder of DuckDuckGo Explains Why Challenging Google Isn’t Insane.” I noted several statements in the write up; namely:

  • DuckDuckGo delivers three billion searches a year, compared to Google’s trillion-plus search per year. The zeros can be confusing to an addled goose like me. Let me say that Google is delivering more search results that DuckDuckGo.com
  • DuckDuckGo’s revenues are in 2015 were more than $1 million. Google’s revenues were about $75 billion. Yep, more zeros.
  • It used to take Google six months to index pages on the Internet. (I thought that Google indexed from its early days based on a priority algorithm. Some sites were indexed in a snappy manner; others, like the National Railway Retirement Board, less snappily. I am probably dead wrong here, but it is a nifty point to underscore Google’s slow indexing. I just don’t think it was or is true.)
  • DuckDuckGo was launched in 2008. The company is almost eight years old.
  • Google’s incognito mode is a myth. What about those Google cookies? (I think the incognito mode nukes those long lived goodies.)

Here’s the passage I highlighted:

Adams (the interviewer): I thought the government could track me whether I use DuckDuckGo or not.

Weinberg (the founder of DuckDuckGo): No they can’t. They can get to your Google searches, but if you use DuckDuckGo it’s completely encrypted between you and us. We don’t store anything. So there’s no data to get. The government can’t subpoena us for records because we don’t have records.

DuckDuckGo beats the privacy drum. That’s okay, but the privacy of Tor and I2P can be called into question. Is it possible that there are systems and methods to track user queries with or without the assistance of the search engine system? My hunch is that there are some interesting avenues to explore from companies providing tools to various government agencies. What about RACs, malware, metadata analyses, etc.? Probably I am wrong again. RATs. I have no immunity from my flawed information. I may have to grab my swim fins and go fin-fishing. I could also join a hacking team and vupen it up.

Stephen E Arnold, February 25, 2016

Brown Dog Fetches Buried Data

February 25, 2016

Outdated file formats, particularly those with no metadata, are especially difficult to search and utilize. The National Science Foundation (NSF) reports on a new search engine designed to plumb the unstructured Web in, “Brown Dog: A Search Engine for the Other 99 Percent (ofData).” With the help of a $10 million award from the NSF, a team at the University of Illinois-based National Center for Supercomputing Application (NCSA) has developed two complementary services. Writer Aaron Dubrow explains:

“The first service, the Data Access Proxy (DAP), transforms unreadable files into readable ones by linking together a series of computing and translational operations behind the scenes. Similar to an Internet gateway, the configuration of the Data Access Proxy would be entered into a user’s machine settings and then forgotten. From then on, data requests over HTTP would first be examined by the proxy to determine if the native file format is readable on the client device. If not, the DAP would be called in the background to convert the file into the best possible format….

“The second tool, the Data Tilling Service (DTS), lets individuals search collections of data, possibly using an existing file to discover other similar files in the data. Once the machine and browser settings are configured, a search field will be appended to the browser where example files can be dropped in by the user. Doing so triggers the DTS to search the contents of all the files on a given site that are similar to the one provided by the use….  If the DTS encounters a file format it is unable to parse, it will use the Data Access Proxy to make the file accessible.”

See the article for more on these services, which NCSA’s Kenton McHenry likens to a DNS for data. Brown Dog conforms to NSF’s Data Infrastructure Building Blocks program, which supports development work that advances the field of data science.

 

Cynthia Murrell, February 25, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Legal Eagles and Technology: The Uber Search Dictum

February 24, 2016

I love legal eagles. When there is a new gadget, the legal eagles are among the first to squawk, “Class action suit” when there is a glitch. Legal eagles are also producing entertaining television commercials to complement the Adwords for various high dollar health related problems. Great stuff.

I know that some fledgling legal eagles are not happy with their law schools. Some of these folks whose parents are not partners in a big money law firm or related to certain public office holders are driving Uber cars.

I read “Judge Tells Uber to Do the Impossible: Control Its Google Results.” The article is very entertaining. The main point is that a federal judge ordered Uber to ensure that certain information appeared in a free Web search results list.

But wait. The write up contained a quote which is a keeper:

To slightly tweak a metaphor offered by this Court during the hearing, a preliminary injunction should not serve as a bazooka in the hands of a squirrel, used to extract from a more fear-some animal a bounty which the squirrel would never be able to gather by his own labors — at least not when the larger animal is mostly without sin.

A squirrel with a bazooka. I would have substituted a drone operator in a tree with a laptop and a Predator under control.

squirrel with bazooka_edited

Squirrel in Kentucky watching for legal eagles.

There you go, Google. Help out Uber. Adjust the results list to display exactly what the judge orders; for example:

A result containing [Uber’s] 352-area-code number

Words clearly indicating that the result is associated with [Uber].

What happens if Uber cannot figure out how to conform to the “command” using Adwords, white hat SEO, black hat SEO, or something more innovative?

What happens if Google helps out Uber?

I love this stuff. Come to think of it. Squirrels may be more technically savvy than some legal eagles. I think I hear from the tree, “Gray squirrel, gray squirrel, call in a strike on my command.”

“On your command,” replies the gray squirrel.

Stephen E Arnold, February 24, 2016

Fun with Google Search Delivers Fun for Google

February 24, 2016

The article on Value Walk titled Top 10 Ways to Have Fun With Google Search invites readers to enjoy a few of the “Easter Eggs” that those nutball programmers over at Google have planted in the search engine. Some are handy, like the spinning coin that gives you a heads or tail result when you type “flip a coin” into Google. Others are just funny, like the way the page tilts if you enter the word “askew.” Others are pure in their nerd factor, as the article explains,

“When you type “Zerg rush” into the search box and hit enter you get a wave of little Google “o”s swarming across and eating the text on your page. Of note, Zerg rush was a tactic used by Zerg players in the late 90s video game StarCraft, which meant the sending many waves of inexpensive units to overwhelm an opponent. Typing “Atari Breakout”…leads to a nostalgic flashback for most people older than 45…”

Speaking of nostalgia, if you type in “Google in 1998” the page reverts to the old layout of the search engine’s early days. In general, the “Easter Eggs” are kind of like watching your uncle’s magic tricks. You aren’t really all that impressed, but every now and then a little surprise makes you smile. And you are definitely going to make him do them again in front of your parents later.

 

Chelsea Kerwin, February 24, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta