Why Enterprise Search Remains a Problem

August 8, 2020

I read “Let’s Build a Full-Text Search Engine.” The write up does a reasonable job of walking through the basics of building a search engine. The focus is full text search, but I think in terms of an organization and its content. As a result, the system summarized will not handle video, images, and other types of content. The code examples are clear, and I liked the straightforward approach.

However, there is a potential bump in the information superhighway. Here’s a Venn diagram from the article. Notice the work you have to do to find documents with small, wild cat?

image

If I search for “smith”, “order”, “tile” — I want only the documents in which the Boolean AND is applied by default. I want Smith’s orders for tile. I have to call the person. I don’t want to go on scavenger hunt. (There are other minor nits too, but the AND’ing thing is huge to me.)

Stephen E Arnold, August 6, 2020

Twitter: Another Almost Adult Moment

August 7, 2020

Indexing is useful. Twitter seems to be recognizing this fact. “Twitter to Label State-Controlled News Accounts” reports:

The company will also label the accounts of government-linked media, as well as “key government officials” from China, France, Russia, the UK and US. Russia’s RT and China’s Xinhua News will both be affected by the change. Twitter said it was acting to provide people with more context about what they see on the social network.

Long overdue, the idea of an explicit index term may allow some tweeters to get some help when trying to figure out where certain stories originate.

Twitter, a particularly corrosive social media system, has avoided adult actions. The firm’s security was characterized in a recent DarkCyber video as a clown car operation. No words were needed. The video showed a clown car.

Several questions from the DarkCyber team:

  1. When will Twitter verify user identities, thus eliminating sock puppet accounts? Developers of freeware manage this type of registration and verification process, not perfectly but certainly better than some other organizations’.
  2. When will Twitter recognize that a tiny percentage of its tweeters account for the majority of the messages and implement a Twitch-like system to generate revenue from these individuals? Pay-per-use can be implemented in many ways, so can begging for dollars. Either way, Twitter gets an identification point which may have other functions.
  3. When will Twitter innovate? The service is valuable because a user or sock puppet can automate content regardless of its accuracy. Twitter has been the same for a number of Internet years. Dogs do age.

Is Twitter, for whatever reason, stuck in the management mentality of a high school science club which attracts good students, just not the whiz kids who are starting companies and working for Google type outfits from their parents’ living room?

Stephen E Arnold, August 7, 2020

Fixing Language: No Problem

August 7, 2020

Many years ago I studied with a fellow who was the world’s expert on the morpheme _burger. Yep, hamburger, cheeseburger, dumbburger, nothingburger, and so on. Dr. Lev Sudek (I think that was his last name but after 50 years former teachers blur in my mind like a smidgen of mustard on a stupidburger.) I do recall his lecture on Indo-European languages, the importance of Sanskrit, and the complexity of Lithuanian nouns. (Why Lithuanian? Many, many inflections.) Those languages evolving or de-volving from Sanskrit or ur-Sanskrit differentiated among male, female, singular, neuter, plural, and others. I am thinking 16 for nouns but again I am blurring the Sriacha on the Incredible burger.

This morning, as I wandered past the Memoryburger Restaurant, I spotted “These Are the Most Gender-Biased Languages in the World (Hint: English Has a Problem).” The write up points out that Carnegie Mellon analyzed languages and created a list of biased languages. What are the languages with an implicit problem regarding bias? Here a list of the top 10 gender abusing, sexist pig languages:

  1. Danish
  2. German
  3. Norwegian
  4. Dutch
  5. Romanian
  6. English
  7. Hebrew
  8. Swedish
  9. Mandarin
  10. Persian

English is number 6, and if I understand Fast Company’s headline, English has a problem. Apparently Chinese and Persian do too, but the write up tiptoes around these linguistic land mines. Go with the Covid ridden, socially unstable, and financially stressed English speakers. Yes, ignore the Danes, the Germans, Norwegians, Dutch, and Romanians.

So what’s the fix for the offensive English speakers? The write up dodges this question, narrowing to algorithmic bias. I learned:

The implications are profound: This may partially explain where some early stereotypes about gender and work come from. Children as young as 2 exercise these biases, which cannot be explained by kids’ lived experiences (such as their own parents’ jobs, or seeing, say, many female nurses). The results could also be useful in combating algorithmic bias.

Profound indeed. But the French have a simple, logical, and  “c’est top” solution. The Académie Française. This outfit is the reason why an American draws a sneer when asking where the computer store is in Nimes. The Académie Française does not want anyone trying to speak French to use a disgraced term like computer.

How’s that working out? Hashtag and Franglish are chugging right along. That means that legislating language is not getting much traction. You can read a 290 page dissertation about the dust up. Check out “The Non Sexist Language Debate in French and English.” A real thriller.

The likelihood of enforcing specific language and usage changes on the 10 worst offenders strikes me as slim. Language changes, and I am not sure the morpheme –burger expert understood decades ago how politicallycorrectburgers could fit into an intellectual menu.

Stephen E Arnold, August 7, 2020

Kaggle ArXiv Dataset

August 7, 2020

“Leveraging ML to Fuel New Discoveries with the ArXiv Dataset” announces that more than 1.7 million journal-type papers are available without charge on Kaggle. DarkCyber learned:

To help make the ArXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable ArXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

What’s Kaggle? The article explains:

Kaggle is a destination for data scientists and machine learning engineers seeking interesting datasets, public notebooks, and competitions. Researchers can utilize Kaggle’s extensive data exploration tools and easily share their relevant scripts and output with others.

The ArXiv contain metadata for each processed paper (document), including these fields:

  • ID: ArXiv ID (can be used to access the paper, see below)
  • Submitter: Who submitted the paper
  • Authors: Authors of the paper
  • Title: Title of the paper
  • Comments: Additional info, such as number of pages and figures
  • Journal-ref: Information about the journal the paper was published in
  • DOI: [https://www.doi.org](Digital Object Identifier)
  • Abstract: The abstract of the paper
  • Categories: Categories / tags in the ArXiv system
  • Versions: A version history

Details about the data and their location appear at this link. You can use the ArXiv ID to download a paper.

What if you want to search the collection? You may want to download the terabyte plus file and index the json using your favorite search utility. There’s a search system available from ArXiv and you can use the site: operator on Bing or Google to see if one of those ad-supported services will point you to the document set you need.

DarkCyber wants to suggest that you download the corpus now (datasets can go missing) and use your favorite search and retrieval system or content processing system to locate and make sense of the ArXiv content objects.

Stephen E Arnold, August 7, 2020

Me Too, Me Too: Password Matching

August 7, 2020

Digital Shadows, founded in 2011, offered its Searchlight service. Terbium Labs, founded in 2013, offers its Matchlight services. Enzoic, founded in 2016, offered its password matching service. Scattered along the information highway are other cyber security firms offering variations on looking for compromised information on the Regular Web, the Dark Web, and in any other online source which the crawlers can reach. I mention these companies and their similar matching services because DarkCyber spotted “LogMeIn Introduces New Lastpass Security Dashboard and Dark Web Monitoring, Delivering a Complete Command Center for Managing Digital Security.” The write up states:

In addition to displaying weak and reused passwords, the new Security Dashboard now gives all LastPass users, regardless of tier, a full picture of their online security, providing complete control over their digital life and peace of mind that accounts are protected.

What’s interesting is that the capability to perform this type of LastPass check has been around for many years. Progress. People seeing the “light”? Some bad actors simply brute force passwords because many individuals prefer passwords from this list. The fact that strong passwords are not widely used contributes to bad actors’ success.

Stephen E Arnold, August 7, 2020

Google and the US: Winning Friends in China Not

August 7, 2020

DarkCyber spotted this weak beacon of adulting: “YouTube Bans Over 2,000 Chinese Accounts for Coordinated Influence Operations.” The write up states:

Between April and June this year, the company’s division responsible for combating government-backed attacks, Threat Analysis Group (TAG) took down about 2,600 YouTube accounts, significantly up from the 277 channels it blocked in the first three months of 2020. Most of these channels posted “spammy, non-political content”, Google said in a blog post, but some of them were actively participating in a spam network and uploaded political content primarily in Chinese.

Interesting. In an unrelated action DarkCyber wants to thank a reader for sending us a link to this story: “Pompeo Offers $10 Million Reward For Information On Foreign Election Interference.” The article reports:

In his latest speech excoriating China and the American tech industry for helping to enable untrustworthy Chinese companies by including their apps in various app stores, the Secretary of State warned Wednesday that the US was working to rein in Chinese cloud providers, while encouraging US tech firms to drop certain Chinese-run apps from their app stores. Pompeo also revealed the state department would offer $10 million reward for the identity or location of “any person who acting at the direction of a foreign government interferes with US. elections by engaging in certain criminal cyber activities.”

If these reports are accurate, Google and the US are unlikely to be perceived as positive factors in China’s effort to thrive globally.

Stephen E Arnold, August 7, 2020

The WhatsApp Information Warp: Small Worlds and Willful Blindness

August 6, 2020

WhatsApp is part of the new Facebook. Messaging, not email, is becoming the go-to way to handle many online tasks. Need to make a voice phone call? WhatsApp first to set up a time. Want to buy contraband? Consult a WhatsApp group populated with fellow WhatsAppers. Want to get accurate information? Ask a person whom one knows or consult members of a small world.

WhatsApp Adds Web Search Feature to Help Users Debunk Misinformation” explains:

WhatsApp users in Ireland can now quickly check the contents of forwarded messages in a web search to help expose misinformation… The trial is WhatsApp’s latest attempt to stop the spread of misinformation on the platform after it introduced a limit to the number of times a message can be forwarded on earlier this year. The company confirmed that the new web search feature would begin rolling out today on both Android and iOS for users of the latest version of WhatsApp in Ireland, the UK, the US, Brazil, Italy, Mexico and Spain.

Helpful? Facebook is just another member of a WhatsApp user’s world, a very small world. The user has WhatsApp individuals in his or her circle of friends or contacts. Facebook is just in that circle, whether its consists of five or fifty individuals. Small worlds are a way of cutting out noise and trimming big knowledge tasks down to a more manageable size. [Note below] A small world may be a function of human intelligence and help explain why individuals prefer to interact in digital echo chambers. A participant in a small world operates in a conceptual space with fewer risks, surprises, and push backs. Stanford wizards explain that “short path lengths between nodes together with highly clustered link structures  necessarily emerge for a wide set of parameters.”

Small worlding may be a coping mechanism.

What happens when a widely used messaging service facilitates small worlds and then adds a workflow which defines what is and is not misinformation. The person in the small world, by definition, does not go looking for a broader context into which to plug an item of information. The WhatsApp user is likely to accept the designation provided by Facebook, which is the provider of the system, the context, and the signal about an item of information. Using an icon circumvents words. Over time, the WhatsApp user relies on the signal and the small world of friends and contacts to provide data, facts, ideas, and validation.

What users and possibly competitors and regulators may overlook is that WhatsApp does more than provide a handy messaging service. WhatsApp becomes a control mechanism either intentionally or unintentionally. Users, happy with the small world’s perceived value and functionality, become more satisfied with their small world. The small world is comfortable, predictable. Why question what one learns in a small world?

Why not? The WhatsApp small world is the digital equivalent of talking with friends and like minded individuals. Facebook, however, may not be a benign enabler and participant in a WhatsApp small world. Facebook can inject messages (advertising), shape content presented to clarify an issue, and herd members of many different small worlds toward a goal. Those in each small world do not, cannot, or choose to ignore a larger world.

WhatsApp warrants informed scrutiny because the small world phenomenon may put filter bubbles into a hypersonic chamber, accelerating molecules of thought to speeds unattainable outside of the WhatsApp machine. Determining what is and what is not valid information is a big play even for Facebook and WhatsApp in my opinion.

[Note] See also “Journalists’ Twitter Use Shows Them Talking within Smaller Bubbles

Stephen E Arnold, August 6, 2020

An Experiment with OpenAI Text Generator

August 6, 2020

Blogger Manuel Araoz experiments with a software once considered too dangerous to release in his post, “OpenAI’s GPT-3 May Be the Biggest Thing Since Bitcoin.” The quip about Bitcoin is a bit off the mark—blockchain has had a slow lift off. This OpenAI innovation, though, is another matter. It is the most adept AI yet at mimicking human writing, which means is that bad actors, PR people, and SEO experts have a new tool with which to bedevil normal humans who operate via human brain power.

Most of Araoz’s article is, in fact, generated by the OpenAI beta algorithm, named GPT-3. See the post if you wish to read the software’s fictional tale of its own adventures in a Reddit forum. The reader is not informed until the end that the piece was generated by its own subject. I suspected it might be, mainly because it was redundant and a few passages were a bit awkward. However, I admit I may not have had those suspicions if I were reading about another subject entirely. At the end, Araoz shares the model he gave the AI as its starting point. He (I believe) writes:

“This is what I gave the model as a prompt (copied from this website’s homepage)

Manuel Araoz’s Personal Website

Bio

I studied Computer Science and Engineering at Instituto Tecnológico de Buenos Aires. I’m located in Buenos Aires, Argentina.

My previous work is mostly about cryptocurrencies, distributed systems, machine learning, interactivity, and robotics. One of my goals is to bring new experiences to people through technology.

I cofounded and was formerly CTO at OpenZeppelin. Currently, I’m studying music, biology+neuroscience, machine learning, and physics.

Blog

JUL 18, 2020

Title: OpenAI’s GPT-3 may be the biggest thing since bitcoin

Tags: tech, machine-learning, hacking

Summary: I share my early experiments with OpenAI’s new language prediction model (GPT-3) beta. I explain why I think GPT-3 has disruptive potential comparable to that of blockchain technology.

Full text:

and then just copied what the model generated verbatim with minor spacing and formatting edits (no other characters were changed). I generated different results a couple (less than 10) times until I felt the writing style somewhat matched my own, and published it. I also added the cover image. Hope you were as surprised as I was with the quality of the result.”

Not really, since we have been following along, but the results are convincing. The author has posted more of his experiments on Twitter, and is excited to work more with GPT-3. “Very strange times lie ahead,” he concludes. We agree.

Cynthia Murrell, August 6, 2020

Is America Losing its Innovation Edge?

August 6, 2020

Google borrowing money, dwindling university funding, a trend toward working from home, and a less than welcoming immigration policy combine to spell trouble for the United States. The Atlantic tells us why in, “America’s Innovation Engine is Slowing.” Writer and R Street resident fellow Caleb Watney begins by describing a recent event that contributed to doubt in education: ICE’s announcement that, should universities switch to online-only classes this fall, their international students would be booted from the country. The agency has since backed down from that stance, but not before it unnerved many current and potential students that might have contributed much to our nation. Watney writes:

“The visa debacle was only the latest of many ominous signs for the United States, long the world’s primary incubator of new technologies, new drugs, new therapies, and new business models. The coronavirus pandemic and the administration’s botched response to it are damaging the engine of American innovation in three major ways: The flow of talented people from overseas is slowing; the university hubs that produce basic research and development are in financial turmoil; and the circulation of people and ideas in high-productivity industrial clusters, such as Silicon Valley, has been impeded.

We noted:

“All three trends started before the coronavirus arrived, but the pandemic has accelerated them in ways that, if left unaddressed, could cripple the U.S. economy for decades. During the difficult economic recovery from COVID-19, closed businesses will be able to reopen and rehire their furloughed workers, and delayed investments will resume. But if the nation’s capacity for economic and technological innovation is diminished, Americans will feel the loss for decades to come—not just in lower GDP but in slower progress toward a vaccine for COVID-19, solutions to climate change, a cure for cancer, and more.”

Bright students from other countries have sought education in the U.S. over the last century, and many have settled here. These folks have made considerable contributions to our country’s progress and economy. Universities, long the sources of academic research that has propelled us forward, will be forced to shutter their labs as funding continues to dwindle. Finally, the trend toward working from home is pulling people away from central meeting places where great ideas are shared and evolve, often during “off” hours. As we are finding out, virtual meetups just aren’t the same. The article covers each of these three factors in some depth, so navigate there for more details.

Watney paints a pretty bleak picture, but ends with a little hope—we have the power to invert these trends, if we choose. First, he advises, we should reverse the current administration’s immigration freeze to communicate to the rest of the world we welcome their inquiring minds. We should also inject a huge chunk of change into university research labs. Packed central meeting places cannot return until we’ve beaten the pandemic, of course. Once that is done, though, we could set the stage for collaboration by building affordable housing in dense cities, Watney suggests. Will we do what it takes to correct our course and avoid falling behind the rest of the world? We shall see.

Cynthia Murrell, August 6, 2020

Facebook and Google Get the Scoop in Australia

August 6, 2020

I read “Forcing Tech Giants to the Table.” The write up explains how the pay Australian publishers scheme will function. The article quoted Australian Treasurer Josh Frydenberg making the framework crystal clear:

We want Google and Facebook to continue to provide these services to the Australian community, which are so much loved and used by Australians. But we want it to be on our terms.

Those high school science club managers are not likely to find the phrase “on our terms” what is required to sit at the physicists’ and mathematicians’ table in the cafeteria.

The services required to deliver cash are summarized this way:

The range of Facebook services subject to arbitration includes Facebook News Feed, Instagram and the Facebook News Tab. The Google services are Google Search, Google News and Google Discover.

That defeats the whole purpose of the “free” services Google provides. On the other hand, if Google does pay for news in an above board manner, maybe the online ad giant can run sponsored messages, really tasteful ads, and present news in a logical order determined by black box algorithmic magic?

The write up adds:

A breach of the code by Facebook or Google could have a few potential outcomes. The first is an infringement notice which has a penalty of $A133,200 for each breach. If the ACCC takes one of the tech giants to court, the maximum penalty is the higher of $A10million, 10% of the digital platform’s turnover in Australia in the past 12 months, or three times the benefit obtained by the tech giant as a result of the breach (if this can be calculated).

Net net: The science club crowd is likely to pout and be forced to fork out real money to legal eagles. These advisers will say, “This Australian thing will not fly.”

In the meantime, Facebook and Google will keep on doing stuff like selling ads, buying market share, and innovating to solve problems like death.

Stephen E Arnold, August 6, 2020

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta