Web Archives
May 4, 2018
Short honk: Here’s a list of Web archives. These services allow one to find pages from a defunct or unavailable Web site or page:
- Archive Today at http://archive.is/
- Internet Archive at http://archive.org/web/
- Perma.cc at https://perma.cc/ (which is a collection of permalinks)
The Beyond Search goose has learned to generate a PDF of information. Quite a bit of content “disappearing” is taking place. To cite one example: Try to locate the list of MIC, RAC, and ZPIC vendors once engaged in locating health care billing fraud and similar misunderstandings. Enjoy your hunt for these items of information.
The source article was “Force Archive Websites to Pick up Webpages with This Handy Tool.”
Stephen E Arnold, May 4, 2018
LucidWorks Has a Search App for That. What?
April 27, 2018
Is there life in enterprise search after many years of hype, razzle dazzle, and over the top marketing?
Maybe?
Lucidworks announced that they have a brand new search tool for enterprise business systems, says Global Newswire in the article, “Lucidworks Launches AI-Powered Site Search App For Enterprise.” The new application is dubbed Lucidworks Site Search and it is an easy configurable, embeddable site-based application.
Lucidworks Site Search uses workflows that optimize natural language processing and machine learning for users to personalize their search results. The application uses rich faceting and filtering to drill down for the most accurate results. Users will be able to access content and insights quicker than older applications.
The Lucidworks CEO said,
“‘Developing a website’s search with both a powerful backend and an elegant UI can be an arduous process. We’ve created Site Search to empower more teams to get site search apps done and out the door,’ explains Lucidworks CEO Will Hayes. ‘By increasing the usability through an applications-based approach, we’re able to bring Lucidworks’ operationalized AI to more customers.’”
We enjoy terms like “operationalize.” Do we understand these MBA inspired noun to verb arabesques? Not really.
Key word search is a useful utility. The new Lucidworks Site Search scans through every document, allows quick configuration, and has an attractive user interface. Elasticsearch does this as well.
We believe the future belongs to vendors with a more comprehensive next generation information access system. In short, more like Palantir Gotham or BAE NetReveal and less like the mainframe centric IBM Stairs approach.
Whitney Grace, April 27, 2018
Taking Time for Search Vendor Limerance
April 18, 2018
Life is a bit hectic. The Beyond Search and the DarkCyber teams are working on the US government hidden Web presentation scheduled this week. We also have final research underway for the two Telestrategies ISS CyberOSINT lectures. The first is a review of the DarkCyber approach to deanonymizing Surface Web and hidden Web chat. The second focuses on deanonymizing digital currency transactions. Both sessions provide attendees with best practices, commercial solutions, open source tools, and the standard checklists which are a feature of my LE and intel lectures.
However, one of my associates asked me if I knew what the word “limerance” meant. This individual is reasonably intelligent, but the bar for brains is pretty low here in rural Kentucky. I told the person, “I think it is psychobabble, but I am not sure.”
The fix was a quick Bing.com search. The wonky relevance of the Google was the reason for the shift to the once indomitable Microsoft.
Limerance, according to Bing’s summary of Wikipedia means “a state of mind which results from a romantic attraction to another person typically including compulsive thoughts and fantasies and a desire to form or maintain a relationship and have one’s feelings reciprocated.”
Upon reflection, I decided that limerance can be liberated from the woozy world of psychologists, shrinks, and wielders of water witches.
Consider this usage in the marginalized world of enterprise search:
Limerance: The state of mind which causes a vendor of key word search to embrace any application or use case which can be stretched to trigger a license to the vendor’s “finding” system.
About That Google Question Answering: Books, Scholar, and Open Source at Its Talon Tips
April 17, 2018
Googzilla prides itself on consuming search queries. Answering those questions? That’s a matter for discussion. Note that here in Harrod’s Creek we understand that if Google does not point to an entity, Web site, or factoid—that entity, Web site, or factoid does not exist. Who knew that those in Harrod’s Creek were into epistemology?
However, Pagal Parrot found “10 Questions Even Google Can’t Answer.” Let us talk a look at the write up’s exemplary 10 questions:
“1. Why does a round pizza come in a square box?
2. Why are boxing rings square?
3.What is Satan’s last name?
4. Why do we press harder on a remote control when we know the batteries are flat?
5. Why is Google not the most translated website?
6. Why do banks charge a fee on ‘insufficient funds’ when they know there is not enough?
7. Why is it that people say they ‘slept like a baby’ when babies wake up, like, every two hours?
8. Why do Baidu lead Google in China?
9. Do Atheist also swear by the Bible /Quran when they go to court?
10. Why do people get angry each time another passenger sits beside them in a seat?”
These questions also beg another question: Do people spend time trying to dumbfound Google? It appears that the answer is, “Folks do try to bedevil the GOOG.”
The article is mostly for giggles, but there are definitely more than 10 questions Google cannot answer. Here is one: When will Google answer questions with precision and recall balanced for relevance and “accuracy”? Would advertisers respond to the functionality?
Whitney Grace, April 17, 2018
Google Argues With Russia About Website Rankings
April 10, 2018
Amidst its employee petitions and the increasing concern about YouTube videos for children, Google is annoyed with Russia.
Google fiddled with its ranking algorithm to stop the dissemination of fake news and Russia believes it is biased against two of its news agencies. Reuters describes more of the argument in the story, “Google Seeks To Defuse Row With Russia Over Website Rankings.” Roskomnadzor called out Alphabet Inc. and its popular search engine Google, when it claimed that Google pushed Russian media sites Sputnik and Russia Today into lower search results.
Eric Schmidt claimed that Google would not be deleting those links, instead they would be pushed lower in search results. Russia claimed Google discriminated against Russia Today and Sputnik, also saying they would take action if necessary. Google responded:
“ ‘We’d like to inform you that by speaking about ranking of web-sources, including the websites of Russia Today and Sputnik, Dr. Eric Schmidt was referring to Google’s ongoing efforts to improve search quality,’ Google said in a letter posted on Roskomnadzor’s website… ‘We don’t change our algorithm to re-rank,’ it added. A Google spokeswoman confirmed the letter had been sent by the company but provided no further comment.”
Years ago Mr. Brin’s trip to space fizzled. Now the search giant is finding fault with a country known to use interesting methods to solve problems.
Whitney Grace, April 10, 2017
Mondeca: Another Semantic Search Option
April 9, 2018
Mondeca, based in France, has long been focused on indexing and taxonomy. Now they offer a search platform named, simply enough, Semantic Search. Here’s their description:
“Semantic search systems consider various points including context of search, location, intent, variation of words, synonyms, generalized and specialized queries, concept matching and natural language queries to provide relevant search results. Augment your SolR or ElasticSearch capabilities; understand the intent, contextualize search results; search using business terms instead of keywords.”
A few details from the product page caught my eye. Let’s begin with the Search functionality; the page succinctly describes:
“Navigational search – quickly locate specific content or resource. Informational search – learn more about a specific subject. Compound term processing, concept search, fuzzy search, simple but smart search, controlled terms, full text or metadata, relevancy scoring. Takes care of language, spelling, accents, case. Boolean expressions, auto complete, suggestions. Disambiguated queries, suggests alternatives to the original query. Relevance feedback: modify the original query with additional terms. Contextualize by user profile, location, search activity and more.”
The software includes a GUI for visualizing the semantic data, and features word-processing tools like auto complete and a thesaurus. Results are annotated, with key terms highlighted, and filters provide significant refinement, complete with suggestions. Results can also be clustered by either statistics or semantic tags. A personalized dashboard and several options for sharing and publishing round out my list. See the product page for more details.
Established in 1999, Mondeca delivers pragmatic semantic solutions to clients in Europe and North America, and is proud to have developed their own, successful semantic methodology. The firm is based in Paris. Perhaps the next time our beloved leader, Stephen E Arnold, visits Paris, the company will make time to speak with him. Previous attempts to set up a meeting were for naught. Ah, France.
Cynthia Murrell, April 9, 2018
Video Search: Still a Challenge
April 6, 2018
As MIT Technology Review describes in its article, “The Next Big Step for AI? Understanding Video,” artificial intelligence still tends to have trouble correctly interpreting video. A recent slew of new jobs at YouTube (owned by Google) underscores this flaw—“YouTube is Hiring 10,000 People to Police Offensive Videos,” reports the New York Post. When it comes to objectionable content, algorithms just don’t get it. Yet. Meanwhile, the PR machine keeps running.
MIT Tech editor Will Knight discusses some promising solutions in the above article, beginning close to home with a collaboration between MIT and IBM. He writes:
“MIT and IBM this week released a vast data set of video clips painstakingly annotated with details of the action being carried out. The Moments in Time Dataset includes three-second snippets of everything from fishing to break-dancing. ‘A lot of things in the world change from one second to the next,’ says Aude Oliva, a principal research scientist at MIT and one of the people behind the project. ‘If you want to understand why something is happening, motion gives you lot of information that you cannot capture in a single frame.’” … “The MIT-IBM project is in fact just one of several video data sets designed to spur progress in training machines to understand actions in the physical world. Last year, for example, Google released a set of eight million tagged YouTube videos called YouTube-8M. Facebook is developing an annotated data set of video actions called the Scenes, Actions, and Objects set.”
Knight also mentions Twenty Billion Neurons, which, he notes:
“… Created a custom data set by paying crowdsourced workers to perform simple tasks. One of the company’s cofounders, Roland Memisevic, says it also uses a neural network designed specifically to process temporal vision information.”
So, we should not be surprised if, soon, AI can comprehend what it “sees.” Meanwhile, sites that host video content would do well to employ the judgment of humans.
Cynthia Murrell, April 6, 2018
Build an Alternative Google: How To Wanted
April 6, 2018
Hacker News presented an interesting question, “How would you build an internet scale web crawler?” We have been talking with companies which have developed Internet search systems that are not available for free Web search. Those conversations have produced some fascinating information. Some of the data will be included in my upcoming lecture for a government agency and then in my two presentations at the June 2018 Telestrategies ISS Conference in Prague.
What was interesting about this question was the few people responded. That is interesting because my team’s research for my new presentations on deanonymizing encrypted chat and deanonymizing digital currency transactions pivot on comprehensive Internet indexing. In fact, more companies are indexing the Internet content than at any time in the last 10 years.
The second issue the post triggered was a realization that only a handful of people jumped on the topic. This low response to the question in itself is interesting. With more activity in indexing, why aren’t more people helping out JustinGarrson? That’s a question worth thinking about.
Third, one of the responses to the Hacker News question was a pointer to the YaCy.net open source project. We once included this technology in our Internet Research for Law Enforcement training program. My recollection of the system is fuzzy, so I will get one of my team to take at look.
The final thought the Hacker News’ story triggered was, “Have people just accepted Bing, Google, Qwant, and a handful of metasearch systems as too dominant to challenge?” My view is that an opportunity exists to create a public facing Internet search and retrieval system. The reason? Outstanding alternatives to Bing, Google, and Qwant are available for those who qualify as customers and who are willing to pay the license fees.
My hunch is that just as enterprise search has coalesced around the open source Lucene/Solr technologies, free Web search has become “game over” because the ad supported model has won.
The problem, of course, is that a person looking for information usually does not realize that free Web search results are neither comprehensive, timely, or objective.
I hope individuals like JustinGarrison get the information needed to seize an opportunity in Internet search.
Stephen E Arnold, April 6, 2018
Google and Search: More Churn Turmoil
April 4, 2018
I read “John Giannandrea, Head of Google’s Cornerstone Web-Search Unit, Steps Down.” I found the phrase “steps down” amusing. I think the wizard went to the Apple orchard. Since Mr. Giannandrea ran search, Google search has become less useful to me. Now I have to use multiple search systems to locate what I think are slam dunk queries. Nope. I get some pretty off the wall Google search results.
Two points jumped out of this story for me.
First, Google is forced to go back to one of the early Googlers from the AltaVista.com team. (I did some work for an outfit called PersimmonIT, which was a provider to AltaVista.com.) What’s interesting is that Jeff Dean is one of the really old Google guard. I know he’s bright and capable but that begs this question: “Aren’t their younger, smarter, and as or more capable professionals to get the over hyped Google artificial intelligence operation underway.” I can suggest at least one candidate from the DeepMind team. But, hey, who really cares?
Second, search must be pretty broken. The job has fallen to another old timer at the GOOG. Same question: “Aren’t there younger, more with it technical wizards who can handle the massively complex, software wrapped, advertising centric systems? (Yep, systems because there is “regular” search and “mobile” search. Two search systems are part of the index puzzle Google has built over the years.) Plus, do you remember Google’s “universal” search which, as aBearStearns’ legend has it, was cooked up over a weekend to deal with a PR problem triggered by an analyst’s report to which yours truly contributed. You know “universal.” One query gets you blog content, new Web sites, Google Books, Google Scholar, yada yada. That doesn’t exist and probably will never come to pass for some pretty good reasons. But saying something is just as good as delivering I assume.)
Net net: Google is now a mature company. The founders have distanced themselves from the legal troubles in which the company is mired. The company is caught in the Silicon Valley backlash. The Oracle Jave thing is a Freddie Kruger thing for the GOOG. Management change is a companion to the craziness which seems to characterize some units of the company.
I wonder if a query launched from a desktop computer will return on point results in the near future. I sure hope so.
Stephen E Arnold, April 4, 2018
Hidden Webs May Be a Content Escape Hatch
March 28, 2018
Beyond Search and the Dark Cyber research team discussed a topic which raised some concern among the team. Censorship may be nudging some individuals to the hidden Webs; for example, the Dark Web, i2p, ZeroWeb, etc.
In the wake of several US school shootings, the outcry of more control over gun sales has grown louder. Many organizations have begun to distance themselves from firearms related topics, like YouTube who removed all of their firearms content recently. The response has created a strange subculture, as we discovered in this recent NPR story, “Restricted by YouTube, Gun Enthusiasts are Taking Their Videos to Pornhub.”
According to the story:
“InRangeTV, which has some 144,000 subscribers on its YouTube channel, has chosen to publish videos on an adult website called Pornhub…InRangeTV also recently wrote on Facebook that it is defending “Why are we seeing continuing restrictions and challenges towards content about something demonstrably legal yet not against that which is clearly illegal?” It then posted links to YouTube videos on synthesizing meth and other illicit acts.”
This is an odd place for a freedom of speech battle to take place, but not completely. It seems right in line with something Larry Flynt would have perused. Conversely, as far right leaning content is going closer and closer toward the dark web (pornography is not the dark web, but it feels like that’s the direction this is heading) the dark web is beginning to try to take down YouTube with rightwing trolling at an extreme level. What all this means for average citizens is that search is going to get more complicated, no matter what you are hunting for.
We also noted that a site dedicated to off color content has become the new home for those who are interested in weaponry. We think the shift may be gaining momentum. How does one “find” these types of content? Perhaps encrypted chat or old fashioned word of mouth messaging. Worth watching this possible shift.
Patrick Roland, March 28, 2018