Proof Behind Common Crawl Claims

September 18, 2013

Common Crawl is a non-profit foundation with the mission to build and maintain an open crawl of the Web that can be accessed and analyzed by everyone with the goal of an open Web that supports education, research, and business. Does it sound like too lofty of a goal? According to Common Crawl’s main Web site, Sebastian Spiegler, a volunteer for the foundation, investigated the crawl’s effectiveness, says the post, “A Look Inside Our 210TB 2012 Web Corpus.”

Spiegler wanted to see how the crawl measured up, so he conducted an exploratory analysis about its 2012 data. He wrote a summary paper to share his findings and he called it,

“The 2012 Common Crawl corpus is an excellent opportunity for individuals or businesses to cost- effectively access a large portion of the internet: 210 terabytes of raw data corresponding to 3.83 billion documents or 41.4 million distinct second- level domains. Twelve of the top-level domains have a representation of above 1% whereas documents from .com account to more than 55% of the corpus. The corpus contains a large amount of sites from youtube.com, blog publishing services like blogspot.com and wordpress.com as well as online shopping sites such as amazon.com. These sites are good sources for comments and reviews. Almost half of all web documents are utf-8 encoded whereas the encoding of the 43% is unknown. The corpus contains 92% HTML documents and 2.4% PDF files. The remainders are images, XML or code like JavaScript and cascading style sheets.”

Spiegler found that Common Crawl is a cost-effective solution to crawl Web data and it yields high results.  Inexpensive, feasible solutions are desirable, so Common Crawl just needs to ramp up the advertising.

Whitney Grace, September 18, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

First, Yahoo, Now Microsoft Bing: The Logo Card

September 17, 2013

I just read “Bing Gets a New Logo and Modern Design to Take on Google.” What I find fascinating is that redesign seems to be the go-to method for making it clear that a company is really serious about revenue and value for stakeholders.

The article states:

A year in the making, Bing is dropping its curly blue logo for a modern design that closely matches the rest of Microsoft’s recently redesigned product branding.

I then learned that the color is exactly the same as the color used in “Microsoft’s corporate flag logo.”

Almost as important as color is the change in mobile search. I learned:

One of the big new changes is “Page Zero,” a method to quickly provide an answer or information before a full results page. Page Zero pops up as you type in the search bar on Bing, and if you’re searching for two similarly named people then it allows you to identify the correct subject of your search before the results are listed. For certain queries you might even get news, images, or video links, and common actions like check-in will be displayed on airline queries.

In my September/October column for Information Today (one of the for-fee write ups I still do), I point out that searching for news is getting more difficult, not easier. The flashy interfaces make it difficult to:

  • Determine the date, time, and bibliographic details of some “hits”
  • Spot differences in similar stories because modern design favors cards, tiles, and  sizzle over a meaty results list
  • Figure out why a particular “hit” appears. Results pop up which may be ads, boosted stories, or plain old false drops.

There are some other gotchas in news services as well. I am covering the problem of aliases, filtered content, and shallow back files in my upcoming ISS lecture in Washington, DC, on the 24th of September.

Without a differentiated system, I assume real journalists and many users will embrace a logo redesign. It is supposed to be working for Yahoo. Google, on the other hand, seems to be chugging along with its ad-based business model and announcing that it will bring sci-fi real time translation to the world.

One common thread unites these quite different companies: Body slam PR.

What happened to relevance?

Stephen E Arnold, September 17, 2013

How to Turnaround a Failing Company

September 17, 2013

Jonathan H. Lack has been an associate of ArnoldIT since 1996. His new monograph is Plan to Turn Your Company Around in 90 Days. We recommend this practical and pragmatic guide for managers struggling with shifting economic winds.

Mr. Lack said:

“Every company’s financial and operational situation, culture, and dynamics are different. However, the fundamentals of operating any business and the problems  to which many companies are vulnerable are not that unique. This entire book is based on firsthand experience if helping different types of companies work through very similar problems.”

HighGainBlog said:

This book is written for businesses large and small as well as for CEOs, board members, and managers. Lack’s expertise comes from his role as principal for ROI Ventures, which specializes in turning companies around. He also has 20 years of experience in management and strategic planning. This expertise shines through as he offers sound advice ranging from effectively managing cash-flow to managing staff. We highly recommend this book to drowning professionals looking for a lifeline as well as those interested in injecting new life into their business and gain a few valuable insights along the way.

Plan to Turn Your Company Around in 90 Days is available for purchase online at Amazon.com or at Apress.com under ISBN13: 978-1-4302-4668-8. Order a copy if you are involved in search, content processing, and analytics. This industry sector faces increased cost of sales, long sales cycles, hard-to-control costs, and challenging revenue targets.

Stephen E Arnold, September 17, 2013

Bold Assertions about Big Data Security Threats

September 17, 2013

Big Data comes with its own slew of security problems, but could it actually be used to keep track of them? The idea of using big data to catch security threats is a novel idea and a big one to stand behind. PR Newswire lets is know that, “AnubisBetworks’s Big Data Intelligence Platform Analyses Millions Of Cyber Security Threat Events.” AnubisNetworks is a well-known name in the IT security risk management software and cloud solutions field and its newest product to combat cyber threats is StreamForce. StreamForce is a real-time intelligence platform that detects and analyzes millions of cyber security threats per second.

StreamForce de-duplicates events to help speed up big data storage burden, which is one of the biggest challenges big data security faces.

“Within the new “big-data” paradigm – the exponential growth, availability and use of information, both structured and unstructured – is presenting major challenges for organizations to understand both risks as well as seizing opportunities to optimize revenue. StreamForce goes to the core of dealing with the increasingly complex world of events, across a landscape of distinct and disperse networks, cloud based applications, social media, mobile devices and applications. StreamForce goes a step further than traditional “after-the event” analysis, offering real-time actionable intelligence for risk analysts and decision makers, enabling quick reaction, and even prediction of threats and opportunities.”

StreamForce is the ideal tool for banks, financial institutions, telecommunication companies, government intelligence and defense agencies. Fast and powerful is what big data users need, but does StreamForce really stand behind its claims? Security threats are hard to detect for even the most tested security software. Can a data feather duster really do the trick to make the difference?

Whitney Grace, September 17, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Text Process Made Simple

September 17, 2013

Nothing involving text sees simple: lines of words that go on for miles, often without proper punctuation or any at all. It needs to be cataloged and organized and tagged, but no one really wants to do that task. That is why “TextBlob: Simplified Text Processing” was born. What exactly is TextBlob? Here is the description straight from TextBlob’s homepage:

“TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, translation, and more.”

TextBlob is available for free download and has its own GitHub following. When it comes to installing the library, be aware that it relies on NLTK and pattern.en. Many of the features include: part-of-speech tagging, JSON serialization, word and phrase frequencies, n-grams, word inflection, tokenization, language translation and detection, noun phrase extraction, and sentiment analysis.

After downloading TextBlob, the Web site offers a comprehensive quick start guide for its users to understand how to implement and make the best usage out of the library. Free libraries make the open source community go around and improve ease of use for all users. If you use TextBlob, be sure to share any of your own libraries.

Whitney Grace, September 17, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Bing Offers Users a New Product Search

September 17, 2013

Microsoft’s version of search tries to remain competitive with Google, but lately it has been in the shadows compared to the new Google Glass and practically everything else the search giant does. Bing, however, had made the news again according to Search Engine Watch with the headline: “New Bing Product Search Launches.” The Bing team has decided to integrate shopping results into regular search results so users can see product features, specifications, reviews, related products, and make some more dough from those who pay to have their Web sites driven to the top.

Bing has also change the dashboard to feature three columns to display the shopping results:

· “The larger column will contain the main search results with the familiar blue links.

· The second column contains the Snapshot information complete with image, overview information, reviews related searches and paid ads.

· The third column is the Bing Social Sidebar, when users are signed in. The Social Sidebar adds information from Facebook, Klout and other social networks to help searchers make decisions based on friend or industry-leader recommendations.”

Bing denies that its new search is not pay-to-play and the results will not be skewed in favor of one Web site over another. Do we believe it? Who knows what goes on behind company doors? Paid ads will still appear in new shopping results and there is a new product ad option called Rich Captions for advertisers to add a meta description into search results. And there is the new way to make money. The new product feature has not launched yet, Bing is still tweaking the bugs.

Whitney Grace, September 17, 2013
Follow more happenings at OpenSourceSearch

Trendy Publication Criticizes Redundant Programs

September 16, 2013

I don’t do much work in any government these days. Too old, I suppose. I also don’t have any interaction with trendy blogs and with-it thinkers. I have a a couple of friends who are about 70, and we talk about topics other than technology.

I did read “US Government Blew $321 Million on Redundant IT Programs.” The main point is that out of $82 billion, the news service pointed out that $321 is duplicative expense. I am not too good at math, but I think that $321 million represents less than one percent of the alleged $82 billion. My math skills are not what they used to be, but the percent looks somewhere around 0.00391463414. In short, trivial, a rounding error maybe?

Based on my own experience accrued since I joined Halliburton NUS in 1972 working in the Washington, DC vineyards, my hunch is that the $321 million number is incorrect. If the number were correct, the US government and its elected overseers are doing an outstanding job. In fact, the sole source for the report is the US government itself which is giving itself a GSA style iPad award.

My question, “Is the General Accountability Office study accurate?” I think digging into the US government’s methodology, the time period of the study, and the verification / validation process of the methods used are important. For example, how did the GAO reconcile the different terminology used for information technology acquisitions?

As it stands, the report from GAO and the article makes it clear that the government is doing a better job of managing than I thought possible. In fact, if the GAO study is accurate, the US government has improved its management of procurement in the last decade. I find this management excellent big news. Harvard Business Review will be panting for an analysis of this achievement.

I don’t pant. I just sigh.

Most Fortune 1000 firms have five or more enterprise search systems. None of these work particularly well. Now that’s redundancy.

Stephen E Arnold, September 16, 2013

The Rivals Face Off In The Search Ring

September 16, 2013

Here we go again with Facebook and Google. The two big IT rivals have been vying for control of the Internet for years and Yahoo Small Business Advisor informs us that another face off is coming in the article, “Graph Search Vs. Google.” Facebook CEO Mark Zuckerberg has already changed the way people communicate, but now he wants to change how people search. Instead of relying on basic content results, like Google, Zuckerberg wants Facebook’s Graph Search to return results based on its users friends and their likes. Google CEO Larry Page does not think his company and Facebook need to be rivals, but user speculation cannot help but compare the two and the article lists some of the problems Graph Search face.

There are “dirty likes,” which are likes for a business not based on it genuinely being liked but because of incentives it gives users. Also Graph Search will not be helpful to users who have too little or too many friends, because the results could be too big or too broad. The usual privacy concerns are noted and mobile search still has its limitations.

Here is another big factor that users will like:

“And here’s the thing: Google’s social network does not use ads, letting users see only what they want to see.  Since G+ users don’t face the same pressure that leads to “dirty likes,” their circles are more likely to reflect their own personal interests. So even though Facebook has a much larger user base than Google+, the latter gives users a more personal experience. Plus, the fact that a person can access Gmail, Drive, and YouTube, all on the same website, while also finding personalized search results thanks to G+, is nothing to sneeze at, either.”

I am not looking forward to the news feed for the next few months as Graph Search comes out of its infancy. The true comparisons can only begin at that time, but then so will the rants and raves.

Whitney Grace, September 16, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Top New Social Sites and Why Facebook is no Longer Hot

September 16, 2013

According to the recent MakeUseof.com article “Facebook Usage is Changing – So Which Online Social Activities are Growing?” Facebook is quickly becoming a thing of the past. Instead of utilizing social networks that are tied to their real-life identity, today’s teens are flocking towards other networks that allow them to use pseudonyms and avatars.

Some of the hot new social sites that are taking the world by storm are: Tumblr, Instagram, Snapchat, WhatsApp, and (of course) Twitter.

The article explains why Twitter was included in the list, and how it differentiates itself from Facebook:

“There’s some evidence that Twitter is becoming more popular, with usage among teens doubling in the past year. Twitter might seem a bit stuffy, like one of the established social networks, but it has much in common with some of the upstarts. Unlike Facebook, Twitter doesn’t demand your real name — you can use anything you like as your Twitter handle. You can then engage publicly with other people about topics of interest or set your Twitter account to private and have your tweets visible only to your friends, although most Twitter usage is public. You don’t even have to send your own tweets — you can follow other accounts and just view them.”

As with everything in technology, what’s cutting edge today is inevitably going to be old news tomorrow. With this in mind, its no surprise that Facebook is losing its momentum. I wonder if they, like Twitter, will find new ways to stay relevant.

Jasmine Ashton, September 16, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Academic Integrity Questioned Because of Forgotten Supplemental Note

September 16, 2013

The academic community is supposed to represent integrity, research, and knowledge. When a project goes awry, researchers can understandably get upset, because it could mean several things are on the line: job, funding, tenure, etc. In order to make the findings go the way they want, researchers may be tempted to falsify data. A recent post on Slashdot points to a questionable academic situation: “Request To Falsify Data Published In Chemistry Data.” Is this one situation where data was falsified? Read the original post:

“A note inadvertently left in the ‘supplemental information’ of a journal article appears to instruct a subordinate scientist to fabricate data. Quoting: ‘The first author of the article, “Synthesis, Structure, and Catalytic Studies of Palladium and Platinum Bis-Sulfoxide Complexes,” published online ahead of print in the American Chemical Society (ACS) journal Organometallics, is Emma E. Drinkel of the University of Zurich in Switzerland. The online version of the article includes a link to this supporting information file. The bottom of page 12 of the document contains this instruction: “Emma, please insert NMR data here! where are they? and for this compound, just make up an elemental analysis …” We are making no judgments here. We don’t know who wrote this, and some commenters have noted that “just make up” could be an awkward choice of words by a non-native speaker of English who intended to instruct his student to make up a sample and then conduct the elemental analysis. Other commenters aren’t buying it.'”

“Make up an elemental analysis…,” does that statement sound credible to you? Researchers are supposed to question and analyze every iota of data until there is nothing left to explore. Making something up only leads to false data and will cause future studies to be inaccurate. Is this how all academics are or is it just an isolated incident?
Whitney Grace, September 16, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta