First, California passed major privacy legislation in June. Then in late September, the Trump administration published official principles for a single national privacy standard. Not to be left out, House Democrats previewed their own Internet “Bill of Rights” earlier this month.
Sweeping privacy regulations, in short, are likely coming to the United States. That should be welcome news, given the sad, arguably nonexistent state of our modern right to privacy. But there are serious dangers in any new move to regulate data. Such regulations could backfire — for example, by entrenching already dominant technology companies or by failing to help consumers actually control the data we generate (presumably the major goal of any new legislation).
That’s where Brent Ozar comes in.
Ozar runs a small technology consulting company in California that provides training and troubleshooting for a database management system called Microsoft SQL Server. With a team of four people, Ozar’s company is by all means modest in scope, but it has a small international client base. Or at least it did, until European regulators in May began to enforce a privacy law called the General Data Protection Regulation (GDPR), can carry fines of up to 4% of global revenue.
A few months before the GDPR began to be enforced, Ozar announced that it had forced his company to, in his words, “stop selling stuff to Europe.” As a consumer, Ozar wrote, he loved the regulations; but as a business, he simply couldn’t afford the costs of compliance or the risks of getting it wrong.
And Ozar wasn’t alone. Even larger international organizations like the Los Angeles Times and the Chicago Tribune — along with over 1,000 other news outlets — simply blocked any user accessing their sites with a European IP address rather than confront the costs of the GDPR.
So why should this story play a central role in the push to enact new privacy regulations here in the United States?
Because Ozar illustrates how privacy regulations come with huge costs. Privacy laws are, from one perspective, a transaction cost imposed on all our interactions with digital technologies. Sometimes those costs are minimal. But sometimes those costs can be prohibitive.
Privacy regulations, in short, can be dangerous.
So how can we minimize these dangers?
First, as regulators become more serious about enacting new privacy laws in the United States, they will be tempted to implement generic, broad-based regulations rather than to enshrine specific prescriptions in law. Even though in the fast-moving world of technology, it’s always easier to write general rules than more explicit recommendations, they should avoid this temptation wherever possible.
Overly broad regulations that treat all organizations equallycan end up encouraging “data monopolies” — where only a few companies can make use of all our data. Some organizations will have the resources to comply with complex, highly ambiguous laws; others (like Ozar’s) will not.
This means that the regulatory burden on data should be tiered so that the costs of compliance are not equal across unequal organizations. California’s Consumer Privacy Act confronts this problem directly by opting out specific business segments such as many smaller organizations. The costs of compliance for any new regulation must not give additional advantages to the already-dominant tech companies of the world.
Second, and relatedly, a few organizations are increasingly in charge of much of our data, which presents a huge danger both to our privacy and to technological innovation. Any new privacy regulation must actively incentivize organizations that are smaller to share or pool data so that they can compete with larger data-driven organizations.
One possible solution to this problem is by encouraging the use of what are called privacy enhancing technologies, or PETs, such as differential privacy, homomorphic encryption, federated learning, and more. PETs, long championed by privacy advocates, help balance the tradeoff between the utility of data on the one hand and its privacy and security on the other.
Last, user consent — the idea of users actively consenting to the collection of their data at a given point in time — can no longer play a central role in protecting our privacy. This has long been a dominant aspect of major privacy frameworks (think of all the “I Accept” buttons you’ve clicked to enter a website). But in the age of big data and machine learning, we simply cannot know the value of the information we give up at the point of collection.
The entire value of machine learning lies in its ability to detect patterns at scale. At any given time, the cost to our privacy of giving up small amounts of data is minimal; over time, however, that cost can become enormous. The famous case of Target knowing a teenager was pregnant before her family did, based simply on her shopping habits, is one among many such examples.
As a result, we cannot assume that we are ever fully informed about the privacy we’re giving up at any single point in time. Consumers must be able to exercise rights over their data long after it’s been collected, and those rights should include restricting how it’s being used.
Unless ours laws can adapt to new digital technologies correctly — unless they can calibrate the balance between the cost of the compliance burden and the value of privacy rights they seek to uphold — we run some very real risks. We can all too easily implement new laws that fail to preserve our privacy while also hindering the use of new technology, and both at the same time.
The ability to understand and communicate about data is an increasingly important skill for the 21st-century citizen, for three reasons. First, data science and AI are affecting many industries globally, from healthcare and government to agriculture and finance. Second, much of the news is reported through the lenses of data and predictive models. And third, so much of our personal data is being used to define how we interact with the world.
When so much data is informing decisions across so many industries, you need to have a basic understanding of the data ecosystem in order to be part of the conversation. On top of this, the industry that you work in will more likely than not see the impact of data analytics. Even if you yourself don’t work directly with data, having this form of literacy will allow you to ask the right questions and be part of the conversation at work.
To take just one striking example, imagine if there had been a discussion around how to interpret probabilistic models in the run up to the 2016 U.S. presidential election. FiveThirtyEight, the data journalism publication, gave Clinton a 71.4% chance of winning and Trump a 28.6% chance. As Allen Downey, Professor of Computer Science at Olin College, points out, fewer people would have been shocked by the result had they been reminded that, Trump winning, according to FiveThirtyEight’s model, was a bit more likely than flipping two coins and getting two heads – hardly something that’s impossible to imagine.
What we talk about when we talk about data
The data-related concepts non-technical people need to understand fall into five buckets: (i) data generation, collection and storage, (ii) what data looks and feels like to data scientists and analysts, (iii) statistics intuition and common statistical pitfalls, (iv) model building, machine learning and AI, and (v) the ethics of data, big and small.
The first four buckets roughly correspond to key steps in the data science hierarchy of needs, as recently proposed by Monica Rogati. Although it has not yet been formally incorporated into data science workflows, I have added data ethics as the fifth key concept because ethics needs to be part of any conversation about data. So many people’s lives, after all, are increasingly affected by the data they produce and the algorithms that use them. This article will focus the first two; I’ll leave the other three for a future article.
How data is generated, collected and stored
Every time you engage with the Internet, whether via web browser or mobile app, your activity is detected and most often stored. To get a feel for some of what your basic web browser can detect, check out Clickclickclick.click, a project that opens a window into the extent of passive data collection online. If you are more adventurous, you can install data selfie, which “collect[s] the same information you provide to Facebook, while still respecting your privacy.”
The collection of data isn’t relegated to merely the world of laptop, smartphone and tablet interactions but the far wider Internet of Things (IoT), a catch-all for traditionally dumb objects, such as radios and lights, that can be smartified by connecting them to the Internet, along with any other data-collecting devices, such as fitness trackers, Amazon Echo and self-driving cars.
All the collected data is stored in what we colloquially refer to as “the cloud” and it’s important to clarify what’s meant by this term. Firstly, data in cloud storage exists in physical space, just like on a computer or an external hard drive. The difference for the user is that the space it exists in is elsewhere, generally on server farms and data centers owned and operated by multinationals, and you usually access it over the Internet. Cloud storage providers occur in two types, public and private. Public cloud services such as Amazon, Microsoft and Google are responsible for data management and maintenance, whereas the responsibility for data in private clouds remains that of the company. Facebook, for example, has its own private cloud.
It is essential to recognize that cloud services store data in physical space, and the data may be subject to the laws of the country where the data is located. This year’s General Data Protection Regulation (GDPR) in the EU impacts user data privacy and consent around personal data. Another pressing question is security and we need to have a more public and comprehensible conversation around data security in the cloud.
The feel of data
Data scientists mostly encounter data in one of three forms: (i) tabular data (that is, data in a table, like a spreadsheet), (ii) image data or (iii) unstructured data, such as natural language text or html code, which makes up the majority of the world’s data.
Tabular data. The most common type for a data scientist to use is tabular data, which is analogous to a spreadsheet. In Robert Chang’s article on “Using Machine Learning to Predict Value of Homes On Airbnb,” he shows a sample of the data, which appears in a table in which each row is a particular property and each column a particular feature of properties, such as host city, average nightly price and 1-year revenue. (Note that data are rarely delivered directly from the user to tabular data; data engineering is an essential step to make data ready for such an analysis.)
Such data is used to train, or teach, machine learning models to predict Lifetime Values (LTV) of properties, that is, how much revenue they will bring in over the course of the relationship.
Image data. Image data is data that consists of, well, images. Many of the successes of deep learning, have occurred in the realm of image classification. The ability to diagnose disease from imaging data, such as diagnosing cancerous tissue from combined PET and CT scans, and the ability of self-driving cars to detect and classify objects in their field-of-vision are two of many use cases of image data. To work with image data, a data scientist will convert an image into a grid (or matrix) of red-green-blue pixel values or numbers and use these matrices as inputs to their predictive models.
Unstructured data. Unstructured data is, as one might guess, data that isn’t organized in either of the above manners. Part of the data scientist’s job is to structure such unstructured data so it may be analyzed. Natural language, or text, provides the clearest example. One common method of turning textual data into structured data is to represent it as word counts, so that “the cat chased the mouse” becomes “(cat,1),(chased,1),(mouse,1),(the,2)”. This is called a bag-of-words model, and allows us to compare texts, to compute distances between them, and to combine them into clusters. Bag-of-words performs surprisingly well for many practical applications, especially considering that it doesn’t distinguish “build bridges not walls” from “build walls not bridges.” Part of the game here is to turn textual data into numbers that we can feed into predictive models, and the principle is very similar between bag-of-words and more sophisticated methods. Such methods allow for sentiment analysis (“is a text positive, negative or neutral?”) and text classification (“is a given article news, entertainment or sport?”), among many others. For a recent example of text classification, check out Cloudera Fast Forward Labs’ prototype Newsie.
These are just two of the five steps to working with data, but they’re essential starting points for data literacy. When you’re dealing with data, think about how the data was collected and what kind of data it is. That will help you understand its meaning, how much to trust it, and how much work needs to be done to convert it into a useful form.