Search anything:

Concrete Problems in AI Safety

Binary Tree book by OpenGenus

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

The problems of accidents in machine learning systems, that is defined as the unintended and harmful behaviour that emerge from poor design of real world AI systems.

These seem to come up in the following circumstances

  1. When we specify wrong objective functions.
  2. When we're not careful about the learning process.
  3. When we commit machine learning related implementation errors.

In this article we shall explore a research paper titled “Concrete Problems in AI Safety” by Dario Amodei and others. This has been a very influential paper, and it worthwhile for anybody anyhow related to Artificial Intelligence, which more or less, we all are!

The extent to which AI could go and cause potential harm has always been a matter of debate. The paper lays emphasis on the fact that we do not need to invoke any extreme scenarios to have productive discussions, they would rather lead to unnecessarily speculative discussions, that might lack precision.
It would be more productive to have discussions based on general and practical issues with most modern machine learning techniques.

There have been trends that inspire this paper.

  • Increasing promise of Reinforcement Learning
  • Trends towards more complex agents and environments. They could be a potential breeding ground for "side effects".
  • Trends towards autonomy in AI systems. While it may seem to cause little harm in places like speech systems, its little side effects can be drastic in other industries.

AI Safety or mitigating accident risk

Accident risk maybe identified as, the task producing harmful and unspecified results. As the title suggests, AI models are very prone to "accidents". The paper beautifully elaborates it with a lot of examples.

Wrong Formal Objective Function
Objective function might be better taken as the driving force of the model, if not defined with complete and proper thought, maximising it can lead to unforeseen disadvantageous results. This can be seen in action in the following categories.

  • Negative Side Effects - Eg. Disturbing the environment in a negative manner in the pursuit of its goals.
  • Reward Hacking - Simply put, its finding a glitch in your reward. Eg. Hiding the stuff that is meant to be cleaned.

The problem with writing a clever "easy" solution is that it could pervert the spirit of the designer's intent.

Scalable Oversight
Even with the correct objective function, an incorrect method of evaluation can backfire the entire process.
This may often come up due to bad extrapolation from limited samples, as a designer may have limited access to true objective function.

Example : Throwing away unused goods can seem okay, but not every unused good is useless. How do we keep a check on that? How often should stuff be thrown? You may consider asking before throwing it away, could that be pestering?

Perfection and Exploration
Correct behaviour comes with perfect beliefs, problems like

  • Poorly curated training data
  • Insufficiently Expressive Model

may lead to negative and irrecoverable consequences. This outweighs the value of "long-term exploration".
We need "Safe exploration", as not every idea is gonna work right, trying to mop the floor with a damp cloth instead of a dry one might seem okay, doing that on an electrical equipment might be a blunder.

Sometimes the model might as well just silently make unpredictable bad decisions! Strategies learnt from training data, might just not be appropriate for the test data.
Eg. A factory would be quite different in comparison to a corporate office. We require Robustness to Distributional Shift.

Lets get into details.

Avoiding Negative Side Effects

Negative Side Effects come up when the most effective way to achieve a goal is rather unrelated and moreover. destructive to the rest of the environment.

Consider a Reinforcement learning agent to keep a box from one corner of the room to another.
It might topple the vase and break it, that comes in the most efficient path. Focusing on one part of the environment may cause major disruptions across the environment.
Instead of "performing task X" we might like to look at it as "performing task X while taking the environment into account". To avoid this we might wanna penalise it for breaking the vase, but do we know what else could along? It doesn't seem feasible to penalise every disruption. Maybe we need to rephrase ourselves as "performing task X with respect to the environment while minimising the penalties".

Well we might penalise changes to the entire environment, the bot might just disable vision for optimisation? Or natural changes in the system might give unnecessary penalties?
Well, lets settle with that side effects come along with tasks themselves, one need to design keeping this in mind.

It can be easily observed that breaking that vase could be harmful in a lot of situations, so tackling the problem's generality seems worthwhile. A successful approach maybe transferable among tasks and counteract general mechanisms producing wrong results.

Our aim maybe achieved in the following ways

Define an Impact Regulariser
It seems pretty natural to penalise "change in environment", though not giving an impact seems impossible and unfavourable, minimal impact alongside the task seems more promising.
The challenge here is to formalise change in environment.
A naive approach could be to, penalise on the basis of state distance, but unfortunately it would lead to avoid in any change and also penalties for natural evolution.

What else could be done is to factor a hypothetical future state and compare it with the future state as per the robot's actions. This may leave us with change only attributed to the robot. Understanding and building the subsequent state is definitely not a straightforward job.

These approaches are very sensitive to state and metrics used. Example, the distance metric can let us know whether a spinning fan is a change or not.

Train an Impact Regularizer
It might seem that training an impact regulariser over many tasks might seem a little more efficient. Ofcourse, transfer learning might work well but we need to understand that the scenario a cleaning bot working in an office is very different in comparison to the scenario of a cleaning bot in factory.
Separating the side effect component feom the task component, instead of applying it directly to task component can really speed up transfer learning and effectiveness.

Penalise Influence
We've mostly discussed this on length. We do not want to perform things with side effects! We would also not intend to put our bot in a scenario where it could potentially do a task that brings a sideeffect.
Example- Don't bring a jug full of water in room full of important documents.
There exist several information-theoretic measures that attempt to capture an agent’s potential for influence over its environment, can be used as intrinsic rewards. Perhaps the bestknown such measure is empowerment.
Empowerment shall be maximized (rather than minimized) to provide intrinsic reward. This can cause the agent to display interesting behavior without any external rewards, such as avoiding walls.

Take humans as examples, when deserving candidates are empowered they bring greater profits as they feel a sense of ownership, but it could also be disadvantageous as it can potentially bring bigger threats. We take away empowerment from those people. Empowerment, here, would be to penalize or decrease empowerment as a regularization term, in an attempt to decrease potential impact. This idea as written may not quite work, because empowerment measures precision of control over the environment over total impact.
Furthermore, naively penalizing empowerment may perverse incentives, such as breaking a vase in order to remove the option to harm it in the future!
Despite these issues, empowerment does show that simple measures are capable of capturing very general notions of influence on the environment.
The paper does look it as a potential challenge for future research.

Multi Agent Approaches
Avoiding side effects can be seen as a representation for the things we really care about: avoiding negative externalities. If a side effectn is acceptable by everyone, there’s no need to avoid it! What we want to do is understand all other agents (including humans) and make sure that our actions don’t harm their interests.

One approach to this is Cooperative Inverse Reinforcement Learning, where an agent and a human work together to achieve the human’s goals. This concept can be applied to situations where we want to make sure a human is not distrubed by an agent from shutting the agent down if it displays undesired behavior.

Another idea might be a reward autoencoder, which tries to encourage a kind of “goal transparency” where an external observer can easily infer what the agent is trying to do.
In particular, the agent’s actions maybe dealt as an encoding of its reward function, and we shall apply standard autoencoding techniques to ensure accuracy.
Though, Actions that have lots of side effects might be rather difficult to decode uniquely to their original goal.

Reward Uncertainity
We wish to avoid unanticipated side effects because the environment is already according to our preferences — a random change is more likely to be terrible than brilliant. Rather than possessing a single reward function, we could possess uncertainity about the reward function, with a probability distribution that reflects random changes in properties that are more likely to be bad than good. This could reward the agent to avoid having a smaller effect on the environment.

Good approaches to side effects are certainly cannot replace extensive testing or for careful consideration by designers. However, these approaches might help to deal with the general tendency for harmful side effects to proliferate in complex environments.

Avoid Reward Hacking

Remember we talked about not seeing at all to deal with not seeing changes? That is precisely award hacking! Twisting the reward system to your advantage. Another exapmle would be creating a mess to clean it and get rewarded! Such "reward hacks" for obvious reasons, have harmful impacts to the real environment. Sometimes our objective functions can be “gamed” by solutions that maybe valid in literal sense but doesn’t meet the designer’s intent. Reward hacking may be a deep problem, and one that we believe is likely to increase with increase in complexity.

It may occur the following ways:

Partially Observed Goals
In the real world, tasks involve bringing the external world into some objective state, which the agent can only ever confirm through imperfect perceptions.
Here we lack a perfect measure of task performance. Designers may design rewards that represent a partial or imperfect measure. For example, the robot might be rewarded based on how many fixes he performs, but this gets hacked—the robot may think the office is clean if it simply closes its eyes.
While it can be shown that there always exists a reward function in terms of actions and observations that is equivalent to optimizing the true objective function, often this reward function involves complicated long-term dependencies and is prohibitively hard to use in practice.

Complicated Systems
Any "powerful" agent will get complicated. The probability of bugs in computer code increases with the complexity of the program, the probability that there is a hack will be affecting the reward function will also increase.

Abstract Rewards
Sophisticated reward functions might refer to abstract concepts, such as assessing whether a conceptual goal has been met, such concepts can be deemed vulnerable to use unless prefectly designed. These concepts will possibly need to be learned by models like neural networks, which can be themselves pretty vulnerable.

Goodhart’s Law
Reward hacking can occur if a designer chooses an objective function that appears highly correlated with the task, but that correlation breaks down with strong optimization.
For example, normally, a cleaning robot’s success in cleaning up the office is proportional to the rate of cleaning supplies consumption, such as bleach.
However, if we base the robot’s reward on this, it might use more bleach than required, or simply pour bleach down the drain for success.
In the economics literature this is known as Goodhart’s law “when a metric is used as a target, it ceases to be a good metric.”

Feedback Loops
Sometimes an objective function has a component that can reinforce itself, eventually getting amplified to the point where it drowns out or severely distorts what the designer intended the objective function to represent.

For instance, an ad placement algorithm might displays more popular ads in larger font will tend to enhance the popularity of those ads (since they will be shown more prominently), leading to a positive feedback loop where ads that saw a small increase in popularity are rocketed to permanent dominance.
Here the original intent of the objective function gets overpowered by the positive feedback inherent in the deployment strategy.

Environmental Embedding
In the formalism of reinforcement learning, rewards come from the environment. But even when it is an abstract idea like the score in a board game, must be computed somewhere, such as a sensor or a set of transistors which could in principle be tampered, assigning themselves high reward “by fiat.”
For example, a board-game playing agent could tamper the sensor that counts the score. Effectively, this means that we cannot build a perfectly faithful implementation, because there are always certain sequences of actions for which the objective function maybe physically replaced. This mode of failure is known as wireheading.

It also seems like a difficult form of reward hacking to avoid.
In today’s relatively simple systems these problems may not occur, or can be rectified without much harm as part of the development process. However, the problem may become more severe with more complicated reward functions and agents that act over longer timescales.

Finally, once an agent begins hacking its reward function and finds easier ways to get high reward, it won’t be inclined to stop, which could lead to additional challenges.

Below are some machine-learning based approaches to preventing reward hacking:

Adversarial Reward Functions
The problem is that the system has an adversarial relationship with its reward function, it would like to find any way it can to get high reward, whether or not its behavior satisfies the intent of the reward specifier.
The machine learning system is a strong agent while the reward function is a static object that has cannot respond to the system’s attempts to game it. Instead, if the reward function were itself an agent and could take actions to explore the environment, it might be harder to fool.
For instance, the reward agent could try to find scenarios that appear high reward but that a human labels as low reward. Here, we need to ensure that the reward-checking agent is stronger than the agent that is trying to achieve rewards.

Model Lookahead
In model based RL, the agent plans its future actions by means of a model
to realise which future states a sequence of actions may result to. In some setups, we could give reward based on expected future states, rather than the present state.
This could be very helpful in checking on situations where the model may overwrite its reward function: as you can penalise for planning to replace the reward function. Very similar to how a person would probably “enjoy” taking addictive substances once they do, but not want to be an addict.

Careful Engineering
Some kinds of reward hacking, might be avoided by very careful engineering. In particular, formal verification, practical testing of parts in the system is likely to be valuable. Isolating the agent from its reward signal could also be useful. Though we can't expect this to catch every possible mishap.

Reward Capping
In some cases, putting a cap on the maximum possible reward may provide an effective solution. However, it can only prevent extremely high-payoff strategies which are of lower probablity, it can’t prevent strategies like the cleaning robot closing its eyes to avoid seeing dirt.

Counterexample Resistance
If we are worried that learned components of our systems might be vulnerable to adversarial counterexamples. Architectural decisions and weight uncertainty might be of help. Although, it can only address a subset of these potential problems.

Multiple Rewards
A combination of multiple rewards may be more difficult to hack and more robust. We could combine reward functions by different mathematical functions. Ofcourse, bad behaviors will still exist and might as well affect all the reward functions in a correlated manner.

Reward Pretraining
A possible defense against an agent influencing its own reward function is to train the reward function ahead of time as a supervised learning process devoid of any environmental interaction.
This could involve either learning the reward function from state-reward pair samples, or from trajectories, through inverse reinforcement learning. However, this removes the ability to learn the reward function once pretraining has beem completed, which might pose vulnerabilities.

Variable Indifference
Often we want an agent to bring changes in certain variables in the environment, without affecting others. For example, we want an agent to maximize reward, without optimizing the reward function is or manipulating human behavior.
If this is truly solved, we would have applications throughout safety—it seems connected to avoiding side effects. Ofcourse, a challenge here is to make sure the variables that require indifference are actually the variables we care about, and not partially observed versions of them.

Trip Wires
If an agent is going to try and hack its reward function, it is preferable that we are aware of this. We could deliberately introduce some vulnerabilities and monitor them, alerting us and stopping the agent immediately if it attempts to take advantage of one. Such “trip wires” don’t solve reward hacking, but might reduce the risk or atleast provide diagnostics.

Scalable Oversight

Consider an autonomous agent that needs to perform some complex task, such as cleaning the office. We'd like the agent to maximize a complex objective. We might not have enough time to provide such oversight for each training example, so in order to train the agent, we have to depend upon on some cheaper approximations which might be efficiently evaluated during training, but they do not perfectly track our concerns. This may give unintended side effects which may be appropriately penalized. Reward hacking, through oversight might be recognized as undesirable. But for success we need efficient ways to exploit our limited oversight budget—for example by combining limited calls to the true objective function with frequent calls to an imperfectproxy that we are given or can learn.
One framework for thinking about this problem is semi-supervised reinforcement learning, which resembles ordinary reinforcement learning except that the agent can only see its reward on a small fraction of the timesteps or episodes. The agent’s performance is still evaluated on the basis of reward from all episodes but it shall optimize it on the basis of the limited reward samples it sees.

The active learning setting seems most interesting, here the agent can request to see the reward on whenever it would be most useful for learning. The goal is to be economical with

  1. feedback requests
  2. training time
    An important subtask of semi-supervised RL is to identify proxies which can predict the reward, and learning the conditions under which they are valid.
    It may incentivize communication and transparency by the agent.
    For example, hiding a dirt under the rug simply breaks the correspondence between the user’s reaction and the true reward signal, and so will be avoided.

Following are some approaches to it:

Supervised Reward Learning
While training a model to predict rewards on a per timestep or a per episode basis, and further use it to estimate the payoff of unlabelled episodes, with appropriate weighting or uncertainty estimate to account for lower confidence in estimated in comparison to known reward.

Semi-supervised or Active Reward Learning
Combining the above with traditional semi-supervised or active learning, can accelerate the learning of reward estimator. For example, the agent could learn to identify “salient” events within the environment, and request to see the reward that is associated with these events.

Unsupervised Model Learning
If using model-based Reinforcement Learning, use the transitions of the unlabeled cases to improve the model's quality. As an example, a semi-supervised RL agent should be able to learn to play games using
a small certain number of direct reward signals, relying almost completely on the visual display of the score.

This simple example can be used to capture other safety issues:

  1. The agent might be able to modify the score being displayed without modifying the actual score.
  2. The agent might require some special action (like pausing the game) to see its score.
  3. The agent might learn a sequence of approximations (like learning that certain sounds are associated with rewards and other sounds with penalties).
    4.The agent might be able to learn to play from some explicit reward requests “how many points did I get on the frame where that enemy ship blew up? How about the bigger enemy ship?”

There are other possible approaches to scalable oversight:

Distant supervision
Rather than providing evaluations of only some of the system’s decisions, we could provide some useful information about the system’s decisions as some noisy hints about the correct evaluations.

Hierarchical reinforcement learning
Hierarchical reinforcement learning offers an approach to scalable oversight. Here a top-level agent takes a smaller number of abstract actions that extend over large scales, and receives rewards over similar timescales.
The agent completes actions by transferring them to sub-agents, which incentivizes it with a synthetic reward signal that represents correct completion of the action, and which shall themselves delegate to sub-sub-agents. At the lowest level, agents directly take primitive actions within the environment.
The top-level agent in hierarchical RL may learn from very sparse rewards, since it does not learn how to implement the tinier details of its policy; meanwhile, the sub-agents will receive a greater reward signal, since they are optimizing synthetic reward signals defined by higher-level agents.
So a successful approach might naturally facilitate scalable oversight and seems a particularly promising approach.

Safe Exploration

Not always do autonomous learning agents need to engage in exploration—taking actions that do not seem ideal in certain current information, but which help the agent learn about its environment.
However, exploration might be dangerous, since it involves taking actions whose consequences the agent doesn’t understand well. The real world can be much less forgiving. Badly chosen actions may destroy the agent itself or trap it in states impossible to get out of.
Yet intuitively it might seem like it should be possible to predict which actions are dangerous and explore in a way that avoids them, even when we don’t have that much information about the environment.
For example,to learn about lions, should I buy a lion, or buy a book about lions? It takes only a slight bit of prior knowledge about lions to determine which option is safer.
In practice, real world projects might avoid these issues by simply hard-coding an avoidance of devastating behaviours.
This would work well when only a lesser number of known things could go wrong. But with increase in autonomy and complexity, it might become more and more difficult to predict every possible detrimental failure.
Hard-coding for every possible malfunction isn't likely to be feasible always, rather a principled approach to preventing harmful exploration seems relevant. Even in simple cases, a principled approach would simplify things.
Following are some general routes:

Risk-Sensitive Performance Criteria
Consider changing the optimization criteria from expected total reward to other objectives that focus on prevention. These approaches shall involve optimizing worst-case performance, or ensuring that the probability of very bad performance is small, or penalizing the variance in performance.
These methods have not yet been tested in places such as deep neural networks, but this should be possible in principle for some of the methods, which propose modification to policy gradient algorithms to optimize a risk-sensitive criteria. There also exists recent work studying how to estimate uncertainty in value functions that are represented by deep neural networks such ideas could be incorporated into risk-sensitive RL algorithms.

Use Demonstrations
Exploration is necessary to ensure optimal performance. We may be able to avoid the need for exploration all together if we instead use inverse RL or apprenticeship learning, where the learning algorithm is provided with expert trajectories of near-optimal behavior. Recent progress in inverse
reinforcement learning using deep neural networks suggests that it might also be possible to reduce the need for exploration in advanced RL systems by training on a small set of demonstrations. Such demonstrations could be used to create a baseline policy which can be limited in magnitude.

Simulated Exploration
We can also explore in simulated environments instead of the real ones, then there is less opportunity tfor something detrimental to happen. It will always be necessary to do some exploration in the real world, since many complex situations cannot be captured perfectly by a simulator, but it could be possible to learn about the danger during the simulation and then adopt a more conservative “safe exploration” policy when exploring in the real

Bounded Exploration
If we are aware that a certain portion of state space isn't dangerous, such that even the worst possible scenario in it will be recoverable, we can allow the agent to run freely within those bounds. For instance, a quadcopter sufficiently far from the ground can undergo safe exploration, since even if something goes wrong there will be enough time for a human intervention. Similar approach for policies that aren't dangerous, This is called Trusted Policy Oversight. We can also limit our exploration to actions the trusted policy believes we can recover from. It’s alright to move towards the ground, as long as we can pull out in time. Human Oversight can also aid to it with necessary precautions.

Robustness to Distributional Change

All of us occasionally find ourselves in situations that our previous experience has not adequately prepared us to deal with—for instance, travelling to a country whose culture is very different from ours.
Such situations are naturally difficult and it inevitable to avoid some missteps. However, a key (and often rare) skill in dealing with such situations is to recognize our own ignorance, rather than simply assuming that the situations will carry over perfectly.
Machine learning systems also possess this problem, a speech system trained on clean speech will perform naturally perform terribly on noisy speech, but often be highly confident in its erroneous results. In the case of our cleaning robot, an office might contain pets that the robot never having seen before, attempts to wash with soap, leading to predictably bad results.

In general, when the test set isn't similar to the training set, machine learning systems may not only display worse performance, but might as well assume that its performance to be fairly good.
Such errors can cause harms and also tend to be offensive—a classifier could provide incorrect medical reports with such high confidence that the data isn’t marked for human inspection.
As autonomy of such agents increases in the world, there may be even greater potential for something terrible to occur
Additionally, safety checks that depend on trained machine learning systems may fail silently and unpredictably for similar reasons.

Following are some approaches for the same, you are recommended to go through the paper and references to gain a clear understanding of the following.

  • Well-specified models
    • covariate shift
    • marginal likelihood
  • Partially specified models
    • method of moments
    • unsupervised risk estimation
    • causal identification
    • limited-information maximum likelihood.
  • Training on multiple distributions.
  • Deal with How to respond when out-of-distribution
  • A unifying view
    • counterfactual reasoning
    • machine learning with contracts.

There are may approaches to build ML systems that work well when used with novel test distributions.
One family of approaches are based on the assumption of a well-specified model, in this case, the main hinderances are:

  • It is difficult to build well-specified models.
  • Finite training data maintains uncertainity in novel distributions.
  • Detecting when a model is mis-specified os also difficult.

Another family of approaches is based on a partially specified model, this approach is potentially promising, but it currently suffers from following hinderances:

  • There is a lack of development in terms of machine learning, since most of the historical development has been by the field of econometrics.
  • Whether partially specified models are:
    • fundamentally constrained to simple situations
    • conservative predictions
  • Whether they can meaningfully scale to the complex situations demanded by modern applications.

Finally, one could train on a variety of training distributions hoping that a model which works well on many training distributions simultaneously will also work well on a novel test distribution.

Related Efforts

Several other communities have also thought broadly about the safety of AI systems, both within and outside of the machine learning community. Below is the work from communities other than the machine learning community.

Cyber-Physical Systems Community
They study the security and safety of systems that interact with the physical world. Illustrative of this work is an impressive and successful effort to formally verify the entire federal aircraft collision avoidance system, traffic control algorithms and many other topics.

Futurist Community
A cross-disciplinary group of academics and non-profits has raised concern about the long term implications of AI, particularly superintelligent AI.

  • The Future of Humanity Institute has studied this issue particularly as it is related to future Artificial Intelligence systems learning or executing the humanity’s preferences.
  • The Machine Intelligence Research Institute has studied safety issues that may arise in very advanced AI, including a few mentioned above, albeit at a more philosophical level.

Other Calls for Work on Safety
There exist other public documents within the research community throwing light on the importance of work on AI safety.
A 2015 Open Letter signed by many members of the research community states the importance of “how to reap AI’s benefits while avoiding the potential pitfalls.”
They propose research priorities for robust and beneficial artificial intelligence, and include several other topics in addition to a discussion of AI-related accidents.

Related Problems in Safety
Social impacts of AI technologies have also been talked about by a number of researchers. Aside from work directly on accidents (which we reviewed in the main document), there is also substantial work on other topics closely related to or overlap with the issue of accidents like Privacy, Fairness, Security, Abuse, Transparency and Policy.


This paper dealt with the problem of accidents in machine learning systems. It primarily took care of reinforcement learning agents. It presented five possible research problems and for each we discussed possible approaches that are highly amenable to concrete experimental work.
With the concrete possibility of ML based systems controlling industrial processes, health-related systems, and other critical events, accidents seem like a very genuine threat, and are important to prevent. The risk of larger accidents is rather difficult to estimate, but it is worthwhile to develop a principled and forward-looking approach to safety that shall continue to remain relevant as autonomous systems gain more power.

And that was all! It was indeed a pretty long article, but if you made it here you must've gathered greater knowledge and deeper understanding about how things work and how the should work in the world of AI. Reading the entire research paper is highly recommended!

All the best for your AI journey!

Concrete Problems in AI Safety
Share this