COAT: Discovery of the Hidden World with Large Language Models

1Hong Kong Baptist University, 2MBZUAI, 3Carnegie Mellon University 4The Chinese University of Hong Kong
5The University of Sydney, 6The University of Melbourne,
( *Equal Contribution)
Accepted at NeurIPS 2024

Abstract

Science originates with discovering new causal knowledge from a combination of known facts and observations. Traditional causal discovery approaches mainly rely on high-quality measured variables, usually given by human experts, to find causal relations. However, the causal variables are usually unavailable in a wide range of real-world applications. The rise of large language models (LLMs) that are trained to learn rich knowledge from the massive observations of the world, provides a new opportunity to assist with discovering high-level hidden variables from the raw observational data. Therefore, we introduce COAT: Causal representatiOn AssistanT. COAT incorporates LLMs as an factor proposer that extracts the potential causal factors from unstructured data. Moreover, LLMs can also be instructed to provide additional information used to collect data values (e.g., annotation criteria) and to further parse the raw unstructured data into structured data. The annotated data will be fed to a causal learning module (e.g., the FCI algorithm) that provides both rigorous explanations of the data, as well as useful feedback to further improve the extraction of causal factors by LLMs. We verify the effectiveness of COAT in uncovering the underlying causal system with two case studies of review rating analysis and neuropathic diagnosis.

COAT Framework

img description

Figure 1. Illustration of the COAT framework.

Inspired by real-world causal discovery applications, given a new task with unstructured observational data, COAT aims to uncover the markov blanket with respect to a target variable:
  • (a) Factor Proposal. COAT first adopts an LLM to read, comprehend, and relate the rich knowledge during pre-training to propose a series of candidate factors along with some meta-information such as annotation guidelines.
  • (b) Factor Annotation. Based on the candidate factors, COAT then prompts another LLM to annotate or fetch the structured values of the unstructured data. With the annotated structured data.
  • (c1) Causal Discovery. With the annotated structured data,the causal discovery algorithm is called to find causal relations among the factors
  • (c2) Feedback Construction. By looking at samples where the target variable can not be well explained with the existing factors, LLM is expected to associate more related knowledge to uncover the desired causal factor.

Results on AppleGastronome Benchmark

In AppleGastronome benchmark, we consider the target variable as a rating score of the apple by astronomers. We prepare different high-level factors: 3 parents of Y, one child of Y, and one spouse of Y. These factors form a Markov blanket of Y. In addition, we also prepared one disturbing factor related to Y but not a part of this blanket. A good method is expected to propose the five high-level factors (up to semantical meanings) and exclude the disturbing factor.

Interpolation end reference image.

Box 1. Examples of AppleGastronome data, grouped by the value of scores.


Main Result Empirically, LLMs with CoT can be aware of high-level factors behind data (lower OT than META) but still struggle to distinguish the desired factors in Markov Blanket (higher NMB than COAT ). COAT is more resistant to the ”disturbing” factor, which is supported by the lower NMB column. COAT filters out irrelevant factors from LLMs' prior knowledge that are not reflected by the data, which is supported by the lower OT column. COAT robustly encourages LLM to find more expected factors through the feedback, which is supported by the higher MB column.

Interpolation end reference image.

Table 1. Causal discovery results in AppleGastronome. MB, NMB and OT refer to the number of causal factors discovered in the underlying markov blanket, in the causal graph but not the markov blanket, and the other variables. Recall, precision, and F1 for factor proposal evaluate the discovered causal ancestors. Recall, precision, and F1 for relation extraction evaluate the recovery of the causal edges. The Data baseline refers to pairwise causal relation inference (Kiciman et al., 2023) based on the factors discovered by Data.


Can LLMs be an effective factor proposer? As discussed in Sec. 2.4 of paper, there are two crucial abilities for LLMs in identifying potential high-level factors. The first one is to be aware of the existence of potential factors, and the second is to synthesize and describe these factors. Inspired by this observation, we propose two novel metrics to quantify LLMs' causal ability: a) Perception that quantifies the ratio of valid factors (satisfying Prop. 2.1) proposed by LLMs in each round; b) Capacity that measures the effective mutual information drop in Assumption 2.2. As shown in Fig. 2(c), LLMs differ largely on the perception score while comparably on the capacity score.

Interpolation end reference image.

Figure 2. Quantitative evaluation of the causal capabilities of LLMs in COAT.


Can LLMs be an effective factor annotator? Moreover, since LLMs are also used to annotate the data according to the proposed annotation guidelines, we analyze the capabilities of LLMs in terms of annotation accuracy. As shown in Fig. 2(a-b), both GPT-3.5 and GPT-4 annotate subjective attributes well. Regarding objective human preferences, the performances are still relatively high. In addition, since the annotated results by LLMs will involve additional noises, or even additional confounders, we also conduct independence tests among the annotation noises and the features. It can be found that, with highly capable LLMs, e.g., GPT-4-Turbo, the dependencies can be controlled under an acceptable level.

Interpolation end reference image.

Table 2. Independence tests of the annotation noises with annotated features and other noises in AppleGastronome.

Can COAT reliably recover the causal relationships? We present qualitative results in Fig. 3. Compared to directly adopting LLMs to reason the causal relations, COAT significantly boosts the causal relation recovery. Meanwhile, COAT maintains high performances based on various LLMs, which further demonstrates the effectiveness of the causal feedback in COAT to improve the robustness of this system. In fact, the causal feedback focuses on making maximal use of the rich knowledge of LLMs, and reducing the reliance on the reasoning capabilities of different LLMs, to assist with causal discovery. We provide the full results of 10 LLMs in Appendix E.3.

Interpolation end reference image.

Figure 3. The discovered causal graphs in AppleGastronome.

Results on Neuropathic Benchmark

In the original dataset, there are three levels of causal variables, including the symptom-level, radiculopathy-level and the pathophysiology-level. In this project, we mainly consider the target variable of right shoulder impingement . When generating the clinical diagnosis notes as x using GPT-4, we will avoid any mentioning of variables other than symptoms .

Interpolation end reference image.

Box 2. Examples of Neuropathic data, grouped by the presence of one certain symptom.


Factor proposal. Similarly, we can find that COAT consistently outperforms all of the baselines regardless of which LLMs are incorporated. In particular, COAT can boost the weakest backbone LLaMA2-7b to be better than any other LLMs.

Interpolation end reference image.

Table 3 . Causal discovery results in Neuropathic. PA, AN, and OT refer to the parents, ancestors, and others, respectively. Accuracy and F1 measure the recovery of the causal ancestors.


Causal relation recovery. Due to the faithfulness issue of the original dataset (Tu et al., 2019), we mainly conduct a qualitative comparison between the ground truth that is faithful to the data, against the baselines and COAT.

Interpolation end reference image.

Figure 4. The discovered causal graphs in Neuropathic.

ENSO Case Study

Interpolation end reference image.

Figure 5. The final causal graph found by COAT in the ENSO case study


El Niño–Southern Oscillation (ENSO) Case study. El Niño-Southern Oscillation (ENSO) is a climatic phenomenon in the Pacific Ocean that influences global weather patterns profoundly. To understand its mechanism, we apply COAT on NOAA dataset. There are 13 factors identified by COAT , and their instantaneous causal relations are visualized in Fig 7. The target variable is the future change in monthly SST in the Nino3 region, which could be an important indicator of ENSO events. Each factor is a time series about a certain climate measurement above a specific level averaged over a specific region. The paths about Sea level Pressure, Momentum Flux, and Cloud Cover matches the existing understanding from literature. It also suggests several inserting hypotheses that are less explored in literature, like the path from Soil Temperature in South American Coastal Region. We refer details in the paper's Appendix K.

Contact

Welcome to check our paper for more details of the research work. For any question, please feel free to contact us.

If you find our paper and repo useful, please consider to cite:

                    
@inproceedings{causalcoat2024,
    title={Discovery of the Hidden World with Large Language Models}, 
    author={Chenxi Liu and Yongqiang Chen and Tongliang Liu and Mingming Gong and James Cheng and Bo Han and Kun Zhang},
    year={2024},
    booktitle={Proceedings of the Thirty-eighth Annual Conference on Neural Information Processing Systems}
}