CyberOSINT banner

Alphabet Google Misspells Relevance, Yikes, Yelp?

November 25, 2015

I read “Google Says Local Search Result That Buried Rivals Yelp, Trip Advisor Is Just a Bug.” I thought the relevance, precision, and objectivity issues had been put into a mummy style sleeping bag and put in the deep freeze.

According to the write up:

executives from public Internet companies Yelp and TripAdvisor noted a disturbing trend: Google searches on smartphones for their businesses had suddenly buried their results beneath Google’s own. It looked like a flagrant reversal of Google’s stated position on search, and a move to edge out rivals.

The article contains this statement attributed to the big dog at Yelp:

Far from a glitch, this is a pattern of behavior by Google.

I don’t have a dog in this fight nor am I looking for a dog friendly hotel or a really great restaurant in Rooster Run, Kentucky.

My own experience running queries on Google is okay. Of course, I have the goslings, several of whom are real live expert searchers with library degrees and one has a couple of well received books to her credit. Oh, I forgot. We also have a pipeline to a couple of high profile library schools, and I have a Rolodex with the names and numbers of research professionals who have pretty good search skills.

So maybe my experience with Google is different from the folks who are not able to work around what the Yelp top dog calls, according to the article, “Google’s monopoly.”

My thought is that those looking for free search results need to understand how oddities like relevance, precision, and objectivity are defined at the Alphabet Google thing.

Google even published a chunky manual to help Web masters, who may have been failed middle school teachers in a previous job, do things the Alphabet Google way. You can find that rules of the Google information highway here.

The Google relevance, precision, and objectivity thing has many moving parts. Glitches are possible. Do Googlers make errors? In my experience, not too many. Well, maybe solving death, Glass, and finding like minded folks in the European Union regulators’ office.

My suggestion? Think about other ways to obtain information. When a former Gannet sci tech reporter could not find Cuba Libre restaurant in DC on his Apple phone, there was an option. I took him there even though the eatery was not in the Google mobile search results. Cuba Libre is not too far from the Alphabet Google DC office. No problem.

Stephen E Arnold, November 25, 2015

Indexing: A Cautionary Example

November 17, 2015

i read “Half of World’s Museum Specimens Are Wrongly Labeled, Oxford University Finds.” Anyone involved in indexing knows the perils of assigning labels, tags, or what the whiz kids call metadata to an object.

Humans make mistakes. According to the write up:

As many as half of all natural history specimens held in the some of the world’s greatest institutions are probably wrongly labeled, according to experts at Oxford University and the Royal Botanic Garden in Edinburgh. The confusion has arisen because even accomplished naturalists struggle to tell the difference between similar plants and insects. And with hundreds or thousands of specimens arriving at once, it can be too time-consuming to meticulously research each and guesses have to be made.

Yikes. Only half. I know that human indexers get tired. Now there is just too much work to do. The reaction is typical of busy subject matter experts. Just guess. Close enough for horse shoes.

What about machine indexing? Anyone who has retrained an HP Autonomy system knows that humans get involved as well. If humans make mistakes with bugs and weeds, imagine what happens when a human has to figure out a blog post in a dialect of Korean.

The brutal reality is that indexing is a problem. When dealing with humans, the problems do not go away. When humans interact with automated systems, the automated systems make mistakes, often more rapidly than the sorry human indexing professionals do.

What’s the point?

I would sum up the implication as:

Do not believe a human (indexing species or marketer of automated indexing species).

Acceptable indexing with accuracy above 85 percent is very difficult to achieve. Unfortunately the graduates of a taxonomy boot camp or the entrepreneur flogging an automatic indexing system which is powered by artificial intelligence may not be reliable sources of information.

I know that this notion of high error rates is disappointing to those who believe their whizzy new system works like a champ.

Reality is often painful, particularly when indexing is involved.

What are the consequences? Here are three:

  1. Results of queries are incomplete or just wrong
  2. Users are unaware of missing information
  3. Failure to maintain either human, human assisted, or automated systems results in indexing drift. Eventually the indexing is just misleading if not incorrect.

How accurate is your firm’s indexing? How accurate is your own indexing?

Stephen E Arnold, November 17, 2015

An Early Computer-Assisted Concordance

November 17, 2015

An interesting post at Mashable, “1955: The Univac Bible,” takes us back in time to examine an innovative indexing project. Writer Chris Wild tells us about the preacher who realized that these newfangled “computers” might be able to help with a classically tedious and time-consuming task: compiling a book’s concordance, or alphabetical list of key words, their locations in the text, and the context in which each is used. Specifically, Rev. John Ellison and his team wanted to create the concordance for the recently completed Revised Standard Version of the Bible (also newfangled.) Wild tells us how it was done:

“Five women spent five months transcribing the Bible’s approximately 800,000 words into binary code on magnetic tape. A second set of tapes was produced separately to weed out typing mistakes. It took Univac five hours to compare the two sets and ensure the accuracy of the transcription. The computer then spat out a list of all words, then a narrower list of key words. The biggest challenge was how to teach Univac to gather the right amount of context with each word. Bosgang spent 13 weeks composing the 1,800 instructions necessary to make it work. Once that was done, the concordance was alphabetized, and converted from binary code to readable type, producing a final 2,000-page book. All told, the computer shaved an estimated 23 years off the whole process.”

The article is worth checking out, both for more details on the project and for the historic photos. How much time would that job take now? It is good to remind ourselves that tagging and indexing data has only recently become a task that can be taken for granted.

Cynthia Murrell, November 17, 2015

Sponsored by, publisher of the CyberOSINT monograph


RAVN Pipeline Coupled with ElasticSearch to Improve Indexing Capabilities

October 28, 2015

The article on PR Newswire titled RAVN Systems Releases its Enterprise Search Indexing Platform, RAVN Pipeline, to Ingest Enterprise Content Into ElasticSearch unpacks the decision to improve the ElasticSearch platform by supplying the indexing platform of the RAVN Pipeline. RAVN Systems is a UK company with expertise in processing unstructured data founded by consultants and developers. Their stated goal is to discover new lands in the world of information technology. The article states,

“RAVN Pipeline delivers a platform approach to all your Extraction, Transformation and Load (ETL) needs. A wide variety of source repositories including, but not limited to, File systems, e-mail systems, DMS platforms, CRM systems and hosted platforms can be connected while maintaining document level security when indexing the content into Elasticsearch. Also, compressed archives and other complex data types are supported out of the box, with the ability to retain nested hierarchical structures.”

The added indexing ability is very important, especially for users trying to index from from or into cloud-based repositories. Even a single instance of any type of data can be indexed with the Pipeline, which also enriches data during indexing with auto-tagging and classifications. The article also promises that non-specialists (by which I assume they mean people) will be able to use the new systems due to their being GUI driven and intuitive.

Chelsea Kerwin, October 28, 2015

Sponsored by, publisher of the CyberOSINT monograph


Short Honk: Crawl the Web at Scale

September 30, 2015

Short honk: I read “Aduana: Link Analysis to Crawl the Web at Scale.” The write up explains an open source project which can copy content “dispersed all over the Web.” Keep in mind that the approach focuses primarily on text. Aduana is a special back end for the developer’s tool for speeding up crawls which is built on top of a data management system.

According to the write up:

we wanted to locate relevant pages first rather than on an ad hoc basis. We also wanted to revisit the more interesting ones more often than the others. We ultimately ran a pilot to see what happens. We figured our sheer capacity might be enough. After all, our cloud-based platform’s users scrape over two billion web pages per month….We think Aduana is a very promising tool to expedite broad crawls at scale. Using it, you can prioritize crawling pages with the specific type of information you’re after. It’s still experimental. And not production-ready yet.

In its present form, Aduana is able to:

  • Analyze news.
  • Search locations and people.
  • Perform sentiment analysis.
  • Find companies to classify them.
  • Extract job listings.
  • Find all sellers of certain products.

The write up contains links to the relevant github information, some code snippets, and descriptive information.

Stephen E Arnold, September 30, 2015

Google and Alta Vista: Who Remembers?

September 9, 2015

A lifetime ago, I did some work for an outfit called Persimmon IT. We fooled around with ways to take advantage of memory, which was a tricky devil in my salad days. The gizmos we used were manufactured by Digital Equipment. The processors were called “hot”, “complex”, and AXP. You may know this foot warmer as the Alpha. Persimmon operated out of an office in North Carolina. We bumped into wizards from Cambridge University (yep, that outfit again), engineers housed on the second floor of a usually warm office in Palo Alto, and individuals whom I never met but I had to slog through their email.

So what?

A person forwarded me a link to a what seems to be an aged write up called “Why Did Alta Vista Search Engine Lose Ground so Quickly to Google?” The write up was penned by an UCLA professor. I don’t have too much to say about the post. I was lucky to finish grade school. I missed the entire fourth and fifth grades because my Calvert Course instructor in Brazil died of yellow jaundice after my second lesson.

I scanned the write up, which you may need to register in order to read the article and the comments thereto. I love walled gardens. They are so special.

I did notice that one reason Alta Vista went south was not mentioned. Due to the brilliant management of the company by Hewlett Packard/Compaq, Alta Vista created some unhappy campers. Few at HP knew about Persimmon, and none of these MBAs had the motivation to learn anything about the use of Alta Vista as a demonstration of the toasty Alpha chips, the clever use of lots of memory, and the speed with which certain content operations could be completed.

Unhappy with the state of affairs, the Palo Alto Alta Vista workers began to sniff for new opportunities. One scented candle burning in the information access night was a fledgling outfit Google, formerly Backrub. Keep in mind that intermingling of wizards was and remains a standard operating procedure in Plastic Fantastic (my name for Sillycon Valley).

The baby Google benefited from HP’s outstanding management methods. The result was the decampment from the HP Way. If my memory serves me, the Google snagged Jeff Dean, Simon Tong, Monica Henzinger, and others. Keep in mind that I am no “real” academic, but my research revealed to me and those who read my three monographs about Google that Google’s “speed” and “scaling” benefited significantly from the work of the Alta Vista folks.

I think this is important because few people in the search business pay much attention to the turbo boost HP unwittingly provided the Google.

In the comments to the “Why Did Alta Vista…” post, there were some other comments which I found stimulating.

  1. One commenter named Rajesh offered, “I do not remember the last time I searched for something and it did not end up in page 1.” My observation is, “Good for you.” Try this query and let me know how Google delivers on point information: scram action. I did not see any hits to nuclear safety procedures. Did you, Rajesh? I assume your queries are different from mine. By the way, “scram local events” will produce a relevant hit half way down the Google result page.
  2. Phillip observed that the “time stamp is irrelevant in this modern ear, since sub second search  is the norm.” I understand that “time” is not one of Google’s core competencies. Also, many results are returned from caches. The larger point is that Google remains time blind. Google invested in a company that does time well, but sophisticated temporal operations are out of reach for the Google.
  3. A number of commenting professionals emphasized that Google delivered clutter free, simple, clear results. Last time I looked at a Google results page for this query katy perry the presentation was far from a tidy blue list of relevant results.
  4. Henry pointed out that the Alta Vista results were presented without logic. I recall that relevant results did appear when a query was appropriately formed.
  5. One comment pointed out that it was necessary to cut and paste results for the same query processed by multiple search engines. The individual reported that it took a half hour to do this manual work. I would point out that metasearch solutions became available in the early 1990s. Information is available here and here.

Enough of the walk down memory lane. Revisionism is alive and well. Little wonder that folks at Alphabet and other searchy type outfits continue to reinvent the wheel.

Isn’t a search app for a restaurant a “stored search”? Who cares? Very few.

Stephen E Arnold, September 9, 2015

Indexing Teen Messages?

September 7, 2015

If you are reading teens’ SMS messages, you may need a lexicon of young speak. The UK Department of Education has applied tax dollars to help you decode PAW and GNOC. The problem is that the does not provide a link to the word list. What is available is a link to Netlingo’s $20 list of Internet terms.


Maybe I am missing something in “P999: What Teenage Messages Really Mean?”

For a list of terms teens and the eternally young use, check out these free links:

I love it when “real journalists” do not follow the links about which they write. Some of these folks probably find turning on their turn signal too much work as well.

Stephen E Arnold, September 7, 2015

A Search Engine for College Students Purchasing Textbooks

August 27, 2015

The article on Life Hacker titled TUN’s Textbook Search Engine Compares Prices from Thousands of Sellers reviews TUN, or the “Textbook Save Engine.” It’s an ongoing issue for college students that tuition and fees are only the beginning of the expenses. Textbook costs alone can skyrocket for students who have no choice but to buy the assigned books if they want to pass their classes. TUN offers students all of the options available from thousands of booksellers. The article says,

“The “Textbook Save Engine” can search by ISBN, author, or title, and you can even use the service to sell textbooks as well. According to the main search page…students who have used the service have saved over 80% on average buying textbooks. That’s a lot of savings when you normally have to spend hundreds of dollars on books every semester… TUN’s textbook search engine even scours other sites for finding and buying cheap textbooks; like Amazon, Chegg, and Abe Books.”

After typing in the book title, you get a list of editions. For example, when I entered Pride and Prejudice, which I had to read for two separate English courses, TUN listed an annotated version, several versions with different forewords (which are occasionally studied in the classroom as well) and Pride and Prejudice and Zombies. After you select an edition, you are brought to the results, laid out with shipping and total prices. A handy tool for students who leave themselves enough time to order their books ahead of the beginning of the class.

Chelsea Kerwin, August 27, 2015

Sponsored by, publisher of the CyberOSINT monograph

Misinformation How To: You Can Build a False Identity

August 9, 2015

The Def Con hit talk is summarized in “Rush to Put Death Records Online Lets Anyone Be Killed.” The main idea is that one can fill out forms (inject) containing misinformation. Various “services” suck in the info and then make “smart” decisions about that information. Plug in the right combination of fake and shaped data, and a living human being can be declared “officially” dead. There are quite a few implications of this method which is capturing the imaginations of real and would be bad actors. Why swizzle through the so called Dark Web when you can do it yourself. The question is, “How do search engines identify and filter this type of information?” Oh, right. The search engines do not. Quality and perceived accuracy are defined within the context of advertising and government grants.

Stephen E Arnold, August 9, 2015

Does America Want to Forget Some Items in the Google Index?

July 8, 2015

The idea that the Google sucks in data without much editorial control is just now grabbing brain cells in some folks. The Web indexing approach has traditionally allowed the crawlers to index what was available without too much latency. If there were servers which dropped a connection or returned an error, some Web crawlers would try again. Our Point crawler just kept on truckin’. I like the mantra, “Never go back.”

Google developed a more nuanced approach to Web indexing. The link thing, the popularity thing, and the hundred plus “factors” allowed the Google to figure out what to index, how often, and how deeply (no, grasshopper, not every page on a Web site is indexed with every crawl).

The notion of “right to be forgotten” amounts to a third party asking the GOOG to delete an index pointer in an index. This is sort of a hassle and can create some exciting moments for the programmers who have to manage the “forget me” function across distributed indexes and keep the eager beaver crawler from reindexing a content object.

The Google has to provide this type of third party editing for most of the requests from individuals who want one or more documents to be “forgotten”; that is, no longer in the Google index which the public users’ queries “hit” for results.

According to “Google Is Facing a Fight over Americans’ Right to Be Forgotten.” The write up states:

Consumer Watchdog’s privacy project director John Simpson wrote to the FTC yesterday, complaining that though Google claims to be dedicated to user privacy, its reluctance to allow Americans to remove ‘irrelevant’ search results is “unfair and deceptive.”

I am not sure how quickly the various political bodies will move to make being forgotten a real thing. My hunch is that it will become an issue with legs. Down the road, the third party editing is likely to be required. The First Amendment is a hurdle, but when it comes times to fund a campaign or deal with winning an election, there may be some flexibility in third party editing’s appeal.

From my point of view, an index is an index. I have seen some frisky analyses of my blog articles and my for fee essays. I am not sure I want criticism of my work to be forgotten. Without an editorial policy, third party, ad hoc deletion of index pointers distorts the results as much, if not more, than results skewed by advertisers’ personal charm.

How about an editorial policy and then the application of that policy so that results are within applicable guidelines and representative of the information available on the public Internet?

Wow, that sounds old fashioned. The notion of an editorial policy is often confused with information governance. Nope. Editorial policies inform the database user of the rules of the game and what is included and excluded from an online service.

I like dinosaurs too. Like a cloned brontosaurus, is it time to clone the notion of editorial policies for corpus indices?

Stephen E Arnold, July 8, 2015

Next Page »