Set Data Free from PDF Tables
April 13, 2015
The PDF file is a wonderful thing. It takes up less space than alternatives, and everyone with a computer should be able to open one. However, it is not so easy to pull data from a table within a PDF document. Now, Computerworld informs us about a “Free Tool to Extract Data from PDFs: Tabula.” Created by journalists with assistance from organizations like Knight-Mozilla OpenNews, the New York Times and La Nación DATA, Tabula plucks data from tables within these files. Reporter Sharon Machlis writes:
“To use, download the software from the project website . It runs locally in your browser and requires a Java Runtime Environment compatible with Java 6 or 7. Import a PDF and then select the area of a table you want to turn into usable data. You’ll have the option of downloading as a comma- or tab-separated file as well as copying it to your clipboard.
“You’ll also be able to look at the data it captures before you save it, which I’d highly recommend. It can be easy to miss a column and especially a row when making a selection.”
See the write-up for a video of Tabula at work on a Windows system. A couple caveats: the tool will not work with scanned images. Also, the creators caution that, as of yet, Tabula works best with simple table formats. Any developers who wish to get in on the project should navigate to its GitHub page here.
Cynthia Murrell, April 13, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
A Former Googler Reflects
April 10, 2015
After a year away from Google, blogger and former Googler Tim Bray (now at Amazon) reflects on what he does and does not miss about the company in his post, “Google + 1yr.” Anyone who follows his blog, ongoing, knows Bray has been outspoken about some of his problems with his former employer: First, he really dislikes “highly-overprivileged” Silicon Valley and its surrounds, where Google is based. Secondly, he found it unsettling to never communicate with the “actual customers paying the bills,” the advertisers.
What does Bray miss about Google? Their advanced bug tracking system tops the list, followed closely by the slick and efficient, highly collaborative internal apps deployment. He was also pretty keen on being paid partially in Google stock between 2010 and 2014. The food on campus is everything it’s cracked up to be, he admits, but as a remote worker, he rarely got to sample it.
It was a passage in Bray’s “neutral” section that most caught my eye, though. He writes:
“The number one popular gripe against Google is that they’re watching everything we do online and using it to monetize us. That one doesn’t bother me in the slightest. The services are free so someone’s gotta pay the rent, and that’s the advertisers.
“Are you worried about Google (or Facebook or Twitter or your telephone company or Microsoft or Amazon) misusing the data they collect? That’s perfectly reasonable. And it’s also a policy problem, nothing to do with technology; the solutions lie in the domains of politics and law.
“I’m actually pretty optimistic that existing legislation and common law might suffice to whack anyone who really went off the rails in this domain.
“Also, I have trouble getting exercised about it when we’re facing a wave of horrible, toxic, pervasive privacy attacks from abusive governments and actual criminals.”
Everything is relative, I suppose. Still, I think it understandable for non-insiders to remain a leery about these companies’ data habits. After all, the distinction between “abusive government” and businesses is not always so clear these days.
Cynthia Murrell, April 10, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Predicting Plot Holes Isn’t So Easy
April 10, 2015
According to The Paris Review’s blog post “Man In Hole II: Man In Deeper Hole” Mathew Jockers created an analysis tool to predict archetypal book plots:
A rough primer: Jockers uses a tool called “sentiment analysis” to gauge “the relationship between sentiment and plot shape in fiction”; algorithms assign every word in a novel a positive or negative emotional value, and in compiling these values he’s able to graph the shifts in a story’s narrative. A lot of negative words mean something bad is happening, a lot of positive words mean something good is happening. Ultimately, he derived six archetypal plot shapes.”
Academics, however, found some problems with Jockers’s tool, such as is it possible to assign all words an emotional variance and can all plots really take basic forms? The problem is that words are as nuanced as human emotion, perspectives change in an instant, and sentiments are subjective. How would the tool rate sarcasm?
All stories have been broken down into seven basic plots, so why can it not be possible to do the same for book plots? Jockers already identified six basic book plots and there are some who are curiously optimistic about his analysis tool. It does beg the question if will staunch author’s creativity or if it will make English professors derive even more subjective meaning from Ulysses?
Whitney Grace, April 10, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
The Cost of a Click Through Bing Ads
April 9, 2015
Wow. As an outsider to the world of marketing, I find these figures rather astounding. MarketingProfs shares an infographic titled, “The 20 Most Expensive Bing Ads Keywords.” The data comes from a recent analysis by WordStream of 10 million English keywords, grouped into categories. Writer Vahe Habeshian tells us:
“WordStream analyzed some 10 million English keywords and grouped the them into categories to determine the most expensive types of keywords (see infographic, below).
“(Also see a similar analysis of the most expensive keywords in Google AdWords advertising from 2011.)
“The most expensive keyword on Bing Ads is ‘lawyer,’ which would cost advertisers seeking the top ad spot a whopping $109.21 per click. Not surprisingly, the top 5 keywords are related to the legal world, indicating how lucrative clients can be.”
Yes, almost $110 per click whether legitimate, a human error, or a robot script. That’s a lot of fruitless clicks. It seems irrational, but it must be working if companies keep spending the dough. Right?
The word in second place, “attorney,” comes to $101.77 per click, and “DUI” is a comparative bargain at $68.56. After the top five, law-related words, there are such valuable terms as “annuity,” “rehab,” and “exterminator.” See the infographic for more examples.
Cynthia Murrell, April 09, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Microsoft Streamlining Update Process for SharePoint 2016
April 9, 2015
One of the most frequent complaints from SharePoint users and administrators is the cumbersome update process. It seems that Microsoft is listening and finally responding. Read more in the Redmond Channel Partner article, “Microsoft To Revamp Update Process for SharePoint 2016.”
The article sums up the news:
“The process of updating SharePoint Server will become less cumbersome in the next version of the product, according to a Microsoft executive. Speaking about the upcoming SharePoint 2016 during an IT Unity-hosted talk last Friday, Bill Baer, a Microsoft senior technical product manager and a Microsoft Certified Master for SharePoint, said that IT pros will get smaller updates and that applying them will entail less downtime for organizations.”
Less downtime for organizations will be a welcome change. Stephen E. Arnold is a longtime search expert, and has followed SharePoint through its ups and downs. He often finds that though SharePoint is the most widely adopted enterprise solution, its complicated nature and poor user experience often lead to perceived failures. Keep up with the latest SharePoint news on ArnoldIT.com, specifically the dedicated SharePoint feed, to determine if the streamlining of updates leads to higher marks for SharePoint.
Emily Rae Aldridge, April 9, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Progress in Image Search Tech
April 8, 2015
Anyone interested in the mechanics behind image search should check out the description of PicSeer: Search Into Images from YangSky. The product write-up goes into surprising detail about what sets their “cognitive & semantic image search engine” apart, complete with comparative illustrations. The page’s translation seems to have been done either quickly or by machine, but don’t let the awkward wording in places put you off; there’s good information here. The text describes the competition’s approach:
“Today, the image searching experiences of all major commercial image search engines are embarrassing. This is because these image search engines are
- Using non-image correlations such as the image file names and the texts in the vicinity of the images to guess what are the images all about;
- Using low-level features, such as colors, textures and primary shapes, of image to make content-based indexing/retrievals.”
With the first approach, they note, trying to narrow the search terms is inefficient because the software is looking at metadata instead of inspecting the actual image; any narrowed search excludes many relevant entries. The second approach above simply does not consider enough information about images to return the most relevant, and only most relevant, results. The write-up goes on to explain what makes their product different, using for their example an endearing image of a smiling young boy:
“How can PicSeer have this kind of understanding towards images? The Physical Linguistic Vision Technologies have can represent cognitive features into nouns and verbs called computational nouns and computational verbs, respectively. In this case, the image of the boy is represented as a computational noun ‘boy’ and the facial expression of the boy is represented by a computational verb ‘smile’. All these steps are done by the computer itself automatically.”
See the write-up for many more details, including examples of how Google handles the “boy smiles” query. (Be warned– there’s a very brief section about porn filtering that includes a couple censored screenshots and adult keyword examples.) It looks like image search technology progressing apace.
Cynthia Murrell, April 08, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Google has Made Web Sites Hot and Angry
April 7, 2015
Business Insider tells more about Google’s dominating behavior in “The Google Backlash Is Growing.” The backlash spawned from the FTC’s recently leaked report about how Google threatened to remove Web sites from search engine results if they did not allow Google to use their content.
“At the heart of the matter is the internal FTC report’s finding that Google was effectively blackmailing competing sites like Yelp and Amazon into using their data in its own search result. If they didn’t agree, they would get blacklisted from search results entirely.”
Google was facing a lawsuit, but they made some changes so they were able to escape…in the US. In Europe, an investigation is still underway. Some think the EU is harboring hostilities against a US company, but they are say it is not.
People in the US like Consumer Watchdog want the US Senate to reopen investigations to prove that Google is favoring its own services in search results and making competition appear in lower search rankings. Google, however, maintains its innocence and wants the matter to rest.
Is it not common business practice to downplay the competition? Not to say Google is innocent, but it makes logical sense to use that old school business tactic, especially when they control a whole lot of search.
Whitney Grace, April 7, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Mistakes to Avoid When Migrating to Office 365
April 2, 2015
Sadly, many migrations are considered failures by the organization and users, even if all the content survives. Why is this the case? Well, user experience usually suffers greatly. Redmond Magazine offers more insight and advice in their article, “5 Mistakes To Avoid When Migrating from SharePoint to Office 365.”
The article starts with a mention of the upcoming SharePoint 2016 release, and the every evolving Office 365 before stating:
“The question for many organizations isn’t whether to stay with SharePoint — rather, IT managers are grappling with how to advance its use in the most strategic and cost-effective way possible. As organizations consider a myriad of options from Microsoft, it becomes essential to have not only a long-term strategic technology vision — but also a SharePoint migration and upgrade roadmap that’s big on efficiency and low on cost.“
It is easy to be shortsighted. And while planning is hard and cumbersome, having a long-term plan is one of the only ways to avoid some of the mistakes mentioned in the article. Stephen E. Arnold is another resource to consider when planning. His Web site, ArnoldIT.com, is a top destination for the latest news in search, including SharePoint. His SharePoint feed provides a one-stop-shop for all the latest tips and tricks to assist your organization with their SharePoint planning.
Emily Rae Aldridge, April 2, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Rakuten Goes Into OverDrive
April 1, 2015
If you use a public library or attend school, you might be familiar with the OverDrive system. It allows users to download and read ebooks on a tablet of their choice for a limited time, similar to the classic library borrowing policy. According to Reuters in the article, “Update 2: Rakuten Buying eBook Firm OverDrive For $410 Million In US Push” explains how the Japanese online retailer Rakuten Inc. bought the company.
Rakuten has been buying many businesses in the “sharing economy,” including raising $530 million for Lyft. OverDrive is a sharing company, because it shares books with people. It is not the only reason why Rakuten bought the company:
“Another reason for the purchase is the firm’s reach in the U.S. market, [Takahito Aiki, head of Rakuten’s global eBook business] said. Rakuten has been on a buying spree in recent years to reduce reliance on its home market in Japan. In October it bought U.S. discount store Ebates.com for about $1 billion.”
What does this mean for the textbook industry, though? Will it hurt or help it? When Amazon and other online textbook services launched with cheaper alternatives, the brick and mortar businesses felt the crunch. The cup may be either half full or half empty. Publishers may not be familiar with the sharing economy and may have an opportunity to learn first hand if this deal goes down.
Whitney Grace, April 1, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Is Google Net Neutral?
March 31, 2015
When the FCC passed laws that protect net neutrality, the Internet rejoiced that its crazy antics would be safeguarded and content would not be as regulated when it comes to search retrieval and indexing. Big technology companies that make the bulk of the revenue from Internet related services and products are beginning to voice their opinions on the matter, including Google. Drew Crawford wrote on his blog Sealed Abstract a very heated post about Google’s stance in the entire net neutrality argument: “Google, Our Patron Saint Of The Closed Web.” The blog points out the Google is net neutral with the Droid open market and its employees’ blogs, but apparently Google is also out to destroy the free Web too.
Google plans to take control of all .dev domain addresses and possible others in an effort to have these extensions solely related to Google products and services. In short, if you want to use any domains with this ending, like a blog, you will be forced to use a Google service. It is reminiscent of when Google forced people to sign-up for Google Plus if users wanted to continue using YouTube.
“My point is that if you think Google is some kind of Patron Saint of the Open Web, shit son. Tim Cook on his best day could not conceive of a dastardly plan like this. This is a methodical, coordinated, long-running and well-planned attack on the open web that comes from the highest levels of Google leadership.”
The news is not surprising when you assemble the pieces, but it is disheartening that there do not seem to be any big companies on the little guy’s side. And I thought Google was committed to not being evil.
Whitney Grace, March 31, 2015
Get you copy of CyberOSINT: Next Generation Information
Access at http://www.xenky.com/cyberosint