The Future of Journalism Linked to Content Management Systems
July 17, 2014
The article titled Scoop: A Glimpse Into the NYTimes CMS on the New York Times Blog discusses the importance of Content Management Systems (CMS) for the future of journalism. Recently, journalist Ezra Klein reportedly left The Washington Post for Vox Media largely for Vox’s preferable CMS. The NYT has its own CMS called Scoop, described in the article,
“…It is a system for managing content and publishing data so that other applications can render the content across our platforms. This separation of functions gives development teams at The Times the freedom to build solutions on top of that data independently, allowing us to move faster than if Scoop were one monolithic system. For example, our commenting platform and recommendations engine integrate with Scoop but remain separate applications.”
So it does seem that there is some wheel reinventing going on at the NYT. The article outlines the major changes that Scoop has undergone in the past few years, with live article editing that sounds like Google Docs, tagging, notifications, and simplified processes for the addition of photographs multimedia. While there is some debate about where Scoop stands on the list of Content Management Systems, the Times certainly has invested in it for the long haul.
Chelsea Kerwin, July 17, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Scrape the Content
July 17, 2014
What is artoo.js? It borrows the name for everyone’s favorite Star Wars droid that speaks in beeps. What it does is completely different. It is a piece of JavaScript code that runs on your browser’s console and provides you with scrapping utilities. This is a fine example of what happens when a Star Wars fan combines their computer savvy with their entertainment preference.
While the developer’s geek creed is established, does this make it a good tool? Let us study the features: data download scraped methods, Web crawls, scrapes any Web page, downloads instructions, JQuery is programmed in. Not bad, but why use artoo.js?
“Using browsers as scraping platforms comes with a lot of advantages:
• • Fast coding: You can prototype your code live thanks to JavaScript browsers’ REPL and peruse the DOM with tools specifically built for web development.
• • No more authentication issues: No longer need to deploy clever solutions to enable your spiders to authenticate on the website you intent to scrape. You are already authenticated on your browser as a human being.
Tools for non-devs: You can easily design tools for non-dev people. One could easily build an application with a UI on top of artoo.js. Moreover, it gives you the possibility to create bookmarklets on the fly to execute your personnal scripts.”
We are sold! It offers more features than the average scraper and it makes the hob easier. This is the scrape utility you are looking for.
Whitney Grace, July 17, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Webinar Offered on SharePoint Extranet
July 17, 2014
Webinars are a helpful and popular way to make sense of some of the most complicated issues facing SharePoint users and managers. PremierPoint Solutions is hosting one tomorrow to help make some sense of the extranet. Read more in the PR Web release, “Webinar: A Comprehensive SharePoint Extranet Solution.”
The article begins:
“’Making SharePoint Extranet Collaboration and Management Secure, Easy and Affordable,” a free one-hour webinar about a comprehensive solution for simplifying SharePoint extranet management, will be take place July 2 at 11 a.m. EDT. The session will include a question and answer period. Hosted by PremierPoint Solutions . . . the webinar will demonstrate a proven tool for making the extranet secure and easy, with affordable access and collaboration for business partners, venders, employees and clients from virtually any place in the world.”
For those that find this type of precise support to be helpful, Stephen E. Arnold’s SharePoint feed on his Web site ArnoldIT.com might also be worth keeping an eye on. Arnold lends a career’s worth of expertise to all elements of search, including SharePoint. His tips and tricks are valuable for end users and managers alike.
Emily Rae Aldridge, July 17, 2014
Xoogler Under Pressure: The Yahoo Soap Opera Renews for Another Quarter
July 16, 2014
When Chris Kitze and I started The Point (Top 5% of the Internet), we admired the Yahoo Directory. Our goal was much narrowed than Yahoo’s. We focused on putting Web sites in the Point directory that meet our criteria for family friendly and young student friendly sites. That was in 1993 or 1994. The site was a hit and we sold the company to CMGI, and the Point ended up at Lycos. That deal was pretty successful for me, and I learned three things in the wild and wooly, pre crash Internet era 20 years ago.
First, selling ads was difficult. In the early days, there were no solid guidelines for how big an ad could be. Blinking and flashing were annoying, but there was not user backlash with these lame attempts to attract attention. Proving from log data who clicked and other details required scripts and machine resources to grind through the huge files our Sparcs happily pumped out. I learned that ads were indeed good money. But the 1993 Internet required our team to be the digital equivalent of Roman trireme rowers. I don’t recall much time off, and it was hard work.
Selling ads is hard work. The landscape is altered by the process. There’s no guarantee there’s gold in them thar riverbeds. Source: http://bit.ly/1wuH5ef
Second, advertisers were reluctant to pay up front. A problem Google solved with its “account” method. We were stupid. We sent an invoice, the usage data, and waited for the check to come in the mail. Basic lesson: collecting for any online service can be difficult. When times are tough, advertisers shuffle priorities and our invoices filtered to the bottom of the stack. Collections were painful.
Third, making pages in 1993 was a time consuming affair. We experimented with many technologies, toolkits, and even systems like the incredibly sluggish Cold Fusion were tested in 1995. We learned that the best way to create Web pages in the early 90s was to code ‘em up, shake ‘em out, and let ‘em loose. I repeatedly asked myself, “Why did I agree to put resources into a family friendly online service?”
I read two “real” news stories this morning. Neither has been connected in the blog posts and news streams flowing into my Oversight service. Let me point to each and then offer a handful of observations. I would suggest you keep the three factoids I learned from the Point (Top 5% of the Internet) start up.
The first item is “Yahoo Misses In Q2 With Revenue Of $1.04B, EPS Of $0.37.” At a time when newspapers and magazines are gasping for oxygen, Yahoo seems to have no turbocharger to activate. One Alibaba follows its dream, Yahoo has only its in hand properties and acquisition opportunities to produce another Klondike Gold Rush. The write up said:
Yahoo reported its second-quarter financial performance, including revenue (excluding traffic acquisition costs, or TAC) of $1.04 billion and non-GAAP earnings per share of $0.37. Revenue including TAC was $1.08. Analysts had expected the company to earn $0.38 on revenue of ex-TAC of $1.08 billion.
The quote to note about Yahoo earnings is:
The company stated in its release that revenue growth is its “top priority,” and that it is “not satisfied with [its] Q2 results” in that context.
The second reports presents some good news for Microsoft. True, the write up does not mention the impending layoffs or the dismal device market share that this former monopoly now has. “Microsoft to Surpass Yahoo in Global Digital Ad Market Share This Year.”
Unlike some “experts” I view information about online advertising with considerable skepticism. I don’t think the individual numbers presented an “facts” are important. What struck me as important is this statement:
Yahoo’s push to maintain its position as a top global ad seller will take another hit in 2014, according to new projections from eMarketer. Though Yahoo’s ad revenues will be back in the black this year, increasing its global digital ad revenues by 2.7% after a decline of 2.1% in 2013 to reach $3.53 billion, the company’s share of the $140.15 billion digital advertising market will fall from 2.86% to 2.52%.
Microsoft—believe it or not—appears to be doing better than Yahoo in the ad battle.
The big point in my opinion is that Yahoo has racked up falling ad revenue and will continue to lost online advertising market share, not because other vendors like Microsoft are doing a bang up job. I seem to recall that the Xoogler running Yahoo saw only happy faces in the revenue a few months ago. Like IBM’s slowing arcing down numbers, Yahoo appears to be riding a fading wave.
Several observations:
- Xooglers do not automatically generate money. In fact, Google’s revenue comes from its magical online search results ad system. (Anyone remember GoTo.com and Overture?) I bet Yahoo does.
- Selling online advertising is as difficult today as it was in the era of The Point (Top 5% of the Internet). Google’s approach relies on advertisers who will deposit money to be spent, so some of the collection hassle is ameliorated.
- Yahoo has been in turn around mode for a long time. Maybe AOL and Yahoo should get married and produce fat, happy revenue.
Now about the Yahoo search system. I find the results less than satisfying. I can’t figure out how to look at Louisville-related news. I continue to have difficulty logging into my for fee Yahoo mail account when I am out of the country. I suppose I am the Lone Ranger in my view of Yahoo. That’s okay but I see declines as due to more users than myself.
Stephen E Arnold, July 16, 2014
Search Vendors and SEO: None Is a Home Run Super Star
July 16, 2014
One of the ArnoldIT team located a link to a Web site analytic service called SavedWebHistory.org. From the site, it is possible to enter a url and get some information, mostly without context, about a domain. Some of the numbers are confusing. I plugged in a number of enterprise search vendors’ domain names to see what the SavedWebHistory.org system would report. I have reproduced a table containing the field names and the values for Autonomy, BA Insight, Coveo, Endeca, Funnelback, Mindbreeze, Recommind, Smartlogic, and SurfRay. This list includes some well known companies like Autonomy and Endeca and some companies with average visibility. I also included some lesser known search vendors. The idea was to generate a comparative table with data points pertinent to some of the companies I follow.
You can work through the table or run your own reports. Several points jumped out at me:
- In terms of search engine optimization, Autonomy appears to have its paws on more key words than any of the other vendors in my test sample
- Three vendors have little Alexa presence according to the data; namely, BA Insight, Endeca, and SurfRay. I find that Endeca’s zero score an anomaly. I am not surprised at the inclusion of BA Insight and SurfRay.
- Funnelback has more educational backlinks and governmental backlinks than any other vendor in this sample. Perhaps Funnelback is aggressively pursuing these markets or the Australian government is linking aggressively to Funnelback? Funnelback is also the leader in page views, according to the report for this sample.
- The all important Google PageRank score gives Autonomy a seven rating. The vendor with the lowest PageRank score is SurfRay, a vendor that has an interesting financial and business history. Most search vendors achieve a respectable PageRank score of five. Two legal centric search systems garner a PageRank of six. Lawyers seem to have a gift of lingo approximating that of Autonomy.
I have a frozen Web site at www.arnoldit.com. The score for this site is comparable to the average search engine vendor in traffic and PageRank. I am not sure how valuable these SEO-centric reports are, but if you are a coming looking for sales leads, it might be easier to buy Google AdWords than to try to figure out how to reach today’s Web surfer.
Autonomy | BA Insight | Coveo | Endeca | Funnelback | Mindbreeze | Recommind | Smartlogic | SurfRay | X1 | |
Key Words in SERPs | 1056 | 0 | 201 | 0 | 143 | 41 | 76 | 126 | 62 | 355 |
Google PR | 7 | 5 | 5 | 6 | 6 | 5 | 6 | 5 | 4 | 6 |
Yandex CY | 150 | 10 | 10 | 10 | 10 | 0 | 10 | 40 | 10 | 20 |
Google indexed | 10.3 | 0 | 12.5 | 5.44 | 4.18 | 507 | 1.8 | 13.1 | 918 | 9.75 |
Quantcast rank | 37.362 | 0 | 0 | 0 | 104.859 | 0 | 706.805 | 0 | 0 | 625.031 |
Alexa Rank | 141.452 | 0% | 557.525 | 0 | 349.195 | 440.065 | 600.586 | 449.451 | 0 | 444.193 |
Alexa Traff:Search % | 35.50% | 0 | 26.30% | 20.90% | 0.30% | 0.80% | 18.70% | 32.90% | 16.70% | 9.80% |
Alexa Traff:TimeOnSite | 144 sec | 59 sec | 129 sec | 223 sec | 69 sec | 216 sec | 132 sec | 194 sec | 41 sec | 163 sec |
Alexa Traff:Bounce | 57.90% | 0% | 64.60% | 40.10% | 62.80% | 42.70% | 52% | 63.40% | 61.10% | 43.90% |
Alexa Traff:PageV/User | 2.4 | 1.7 | 1.8 | 3.9 | 4.9 | 2.3 | 220.00% | 3.3 | 1.3 | 2.5 |
External BackLinks | 102.66 | 11.991 | 34.022 | 43.486 | 1.212.796 | 4.037 | 28.666 | 5.277 | 4.28 | 25.913 |
Referring Domains | 5.186 | 267 | 893 | 1.45 | 545 | 302 | 919 | 461 | 241 | 2.17 |
Indexed URLs | 9.38 | 1.625 | 42.736 | 2.207 | 78.497 | 7.517 | 6.945 | 159.279 | 14.606 | 52.694 |
Referring IP addresses | 3.832 | 238 | 678 | 1.245 | 481 | 224 | 718 | 339 | 204 | 1.887 |
Referring SubNets | 3.298 | 230 | 613 | 1.127 | 446 | 201 | 651 | 315 | 186 | 1.647 |
Referring .edu Domains | 122 | 1 | 6 | 32 | 21 | 2 | 6 | 3 | 0 | 15 |
.edu Backlinks | 215 | 437 | 88 | 87 | 703.051 | 3 | 14 | 11 | 0 | 146 |
Referring .gov Domains | 5 | 0 | 1 | 1 | 29 | 0 | 0 | 1 | 0 | 3 |
.gov Backlinks | 19 | 0 | 10 | 3 | 23.209 | 0 | 0 | 1 | 0 | 16 |
Referring .edu Domains to main | 47 | 0 | 5 | 24 | 4 | 1 | 5 | 1 | 0 | 6 |
.edu Backlinks to main | 86 | 0 | 35 | 68 | 6 | 2 | 11 | 1 | 0 | 9 |
Referring .gov Domains to main | 3 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |
.gov Backlinks to main | 15 | 0 | 0 | 3 | 2 | 0 | 0 | 0 | 0 | 4 |
Stephen E Arnold, July 16, 2014
TextTeaser Goes Open Source
July 16, 2014
If you are looking for an auto-summarization tool, TechCrunch says “Auto-Summarization Tool TextTeaser Relaunches As Open Source Code.” Joe Balbin is the creator of TextTeaser and he added it to GitHub after experiencing scalability issues in the API. Balbin recoded the program and the process is now faster. Developers have two plan options: one is $12 for ever 1000 articles summarized, while the enterprise plan is $250/month and comes with a dedicated server to store the article source.
“ ‘In this TextTeaser, you can train your own summarizer,’ Balbin explains. ‘You can provide the category and source of the article that will be used to improve the quality of the summaries. In the future, users might also have the ability to provide what keyword is important and what is not.’ ”
TextTeaser is used in reader apps, such as Gist. Balbin hopes to optimize the program for medical, financial, and legal documents.
TextTeaser sounds like it makes reading faster. The code is a valuable tool. We will stay tuned to see how else it is used.
Whitney Grace, July 16, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
Yandex Starts Local Search App
July 16, 2014
Russia’s answer to Google is Yandex. Yandex is continuously searching for ways to boost its product offerings beyond basic search and now they created their own version of Yelp. TechCrunch says in the article “Yandex’s New Local Search App, Yandex.City, Goes Where No Yelp Has Gone Before” that Yandex wants to tap into the growing middle class who own smartphones. Popular US based local search apps have limited or zero presence in Russia and the local competition is very small. Yandex could easily have the monopoly on local app search for Russia, but they aim to take it to Ukraine, Turkey, Kazakhstan, and Belarus.
What will Yandex’s new app do?
“Yandex.City will link up with Yandex’s extensive mapping operation, which includes not just maps but navigation features to get you there, to provide results based on a user’s specified location. Just as it is with Google, advertising is very much the bread and butter of Yandex’s business, and from what we understand it will use Yandex.City to push geo-targeted and search-related ads to users.”
Yandex.City will mostly likely use technology from the geolocation startup KitLocate the Russian search engine purchases earlier in 2014.
This is a strategic business move for Yandex to compete with Google and other search engines, add more services, and increase its advertising business. If Yandex offers a better product than its foreign competitors then it will beat everyone out. People tend to trust a product that speaks their own language over foreign blasts.
Whitney Grace, July 16, 2014
Sponsored by ArnoldIT.com, developer of Augmentext
An Enterprise Search Case Studies Is Like Hunting for Digital Truffles
July 15, 2014
The job hunters, experts, and consultants in the LinkedIn enterprise search discussion groups have been looking for positive use cases related to enterprise search. Finding a success story that one can verify is similar to hunting truffles. Keep you eye on the pig, or the truffle will disappear.
I did come across one use case published in the Italian Journal of Library and Information Science. You can find it at http://bit.ly/1juFaWi. The title of the paper is “Using a Google Search Appliance (GSA) to search digital library collections: a case study of the INIS Collection Search.” The problem search system was BASISPlus, now a product marketed and mostly “frozen” by OpenText.
The original version of BASIS was created at Battellle Memorial Institute in the late 1960s. Battelle spun out BASIS and Information Dimensions was the result. In 1998, OpenText bought BASIS, and I don’t think there has been much modernization of the system in the last couple of decades.
Yep, that’s an old school mainframey type system. A colleague and I used BASIS when it was an Information Dimensions’ product to provide data management, report, and search functionality to a Bell Communications Research (a chunk of what was Bell Labs) system that was used by the seven Baby Bells for a number of years.
My team and I loved big iron and FORTRAN. We stuffed the IBM MVS TSO system with some tasty BASIS sausage in 1983.
Well, the use case explains that BASIS was not the solution today’s users required. The fix was to license the Google Search Appliance. You can get a taste of the GSA’s license and fail over cluster costs at www.gsaadvantage.gov. Prepare yourself for sticker shock.
Keep in mind that “positive” has a spectrum of meanings determined by the reader’s context. The solution is the Google Search Appliance. You know this product as a search toaster…sort of. The advantages and disadvantages section of the use case hammers on the good parts and tiptoes around the thorns.
Stephen E Arnold, July 15, 2014
Funnelback Demonstration: Australian Government Grants
July 15, 2014
I saw a link to what seems to be an implementation of the Funnelback search system. Some folks see Funnelback as an alternative to the Google Search Appliance (a comparison that eludes me) or Elasticsearch (a little closer to the mark in my opinion).
Navigate to http://www.australiacouncil.gov.au. Enter a query. I used the term “aboriginal.” The results demonstrate that Funnelback has implemented some features that I associate with the 1998-2001 version of Endeca and a Google style results list.
Here’s the variant of what Endeca called “Guided Navigation”:
Here’s the Google style results list:
For a discussion of how one can integrate Squiz Matrix (a content management system) with Funnelback, navigate to the Squizsuite discussion board.
Years ago, I learned that Funnelback was a project of a university/Australian government project. Funnelback popped out of its incubator program and became part of Squiz in 2009. Even though Squiz flies the open source flag, Funnelback is a commercial product.
My Overflight archive shows that when I provided a profile of Funnelback to Commonwealth Scientific and Industrial Research Organization, I received inputs from someone. When the profile was published, the wizard responsible for Funnelback complained that the profile did not reflect his view of the system.
Since that initial interaction with Funnelback and its resident wizard, I have kept the system on my back burner. Getting one’s ducks in a row can be helpful when a third party is writing about a search system for inclusion in a monograph about information retrieval.
My recommendation is to talk with licensees and then, if possible, use the system and run some tests. Accepting a statement that Funnelback is an alternative to the Google Search Appliance is a stretch based on my experience. Is Funnelback comparable to Elasticsearch? The answer is that Elasticsearch has about $100 million in venture funding, outfits who make access to Elasticsearch a cloud solution that requires less fiddling than an on premises solutions, and developers coming out of the woodwork. See, for example, this Search Wizards Speak interview.
Marketing does not equal employee satisfaction with a search system. Testing and analysis are often useful, not the baloney generated by some of the wizards who advise potential licensees. One outfit is selling my work via Amazon without my permission, without a valid contract with me, and without sharing the fee for a report based on my work. When “wizards” run companies, caution is advised.
Stephen E Arnold, July 15, 2014
Hidden from Google: Interesting but Thin
July 15, 2014
I learned about the Web site Hidden from Google. You can check out the service and maybe submit some results that have disappeared. You may not know if the deletion or hiding of the document is a result of the European Right to Be Forgotten action, but if content disappears, this site could be a useful checkpoint.
Here’s what the service looks like as of 9 21 am Eastern on July 15, 2014.
According to the Web site:
The purpose of this site is to list all links which are being censored by search engines due to the recent ruling of “Right to be forgotten” in the EU. This list is a way of archiving the actions of censorship on the Internet. It is up to the reader to decide whether our liberties are being upheld or violated by the recent rulings by the EU.
I noticed that deal old BBC appeared in the list, a handful of media superstars, and some Web sites unknown to me. The “unknown” censored search term is intriguing, but I was not too keen on poking around when I was not sure what I was seeking. Perhaps one of the fancy predictive search engines can provide the missing information or not.
When I clicked on the “source” link sometimes I got a story that seemed germane; for example, http://bbc.in/1xhjKyK linked to one of those tiresome banker misdeed stories. Others pointed to stories that did not seem negative; for example, a guardian article that redirected to a story in Entrepreneur Magazine. http://bit.ly/1jukI7T. Teething pains I presume or my own search ineptness.
I did some clicking around and concluded that the service is interesting but lacks in depth content. I looked for references to the US health care Web sites. I am interested in tracking online access to RFPs, RFQs, and agreements with vendors. These contracts are fascinating because the contractors extend the investigative capabilities of certain US law enforcement entities. Since I first researched the RAC, MIC, and ZPIC contractors, among others, I have noticed that content has become increasingly difficult to find. Content I could pinpoint in 2009 and 2010 now eludes me. Of course, I may be the problem. There could be latency issues when spiders come crawling. There can be churn among the contractors maintaining Web sites. There can be many other issues, including a 21st century version of Adam Smith’s invisible hand. The paw might be connected to an outfit like Xerox or some other company providing services to these programs.
Several questions:
First, if the service depends on crowdsourcing, I am not sure how many of today’s expert searchers will know when a document has gone missing. Unless I had prior knowledge of a Medicare Integrity Contractor statement of work, how would I know I could not find it? Is this a flaw the site will be able to work around.
Second, I am not sure the folks who filled out Google’s form and sent proof of their wants an archive of information that was to go into the waste basket. Is there some action a forgotten person will take when he or she learns he or she is remembered?
Third, the idea is a good one. What happens when Google makes its uncomfortable to provide access to data that Google has removed? Maybe Mother Google is toothless and addled with its newfound interest in Hollywood and fashionable Google Glass gizmos. On the other hand, Google has lots of attorneys in trailers not too far from where the engineers work.
Stephen E Arnold, July 15, 2014