Felix is flexible text-editing approach for generation, designed to derive maximum benefit from the ideas of decoding with bi-directional contexts and self-supervised pretraining. We achieve this by decomposing the text-editing task into two sub-tasks: tagging to decide on the subset of input tokens and their order in the output text and insertion to in-fill the missing tokens in the output not present in the input.
A detailed method description and evaluation can be found in our EMNLP2020 findings paper: https://www.aclweb.org/anthology/2020.findings-emnlp.111/
Felix is built on Python 3, Tensorflow 2 and BERT. It works with CPU, GPU, and Cloud TPU.
Running an experiment with Felix consists of the following steps:
- Create label_map for tagging model
- Convert data for insertion/tagging model.
- Finetune the tagging/insertion models.
- Compute predictions.
Next we go through these steps, using a subset of DiscoFuse (DiscoFuse) task as a running example.
You can run all of the steps with
sh run_discofuse_experiment.sh
After setting the variables in the beginning of the script.
# Label map construction
export OUTPUT_DIR=/path/to/output
python phrase_vocabulary_constructor_main \
--output="${OUTPUT_DIR}/label_map.json" \
--use_pointing="${USE_POINTING}" \
--do_lower_case="True"
Download a pretrained BERT model from the official repository. We've used the 12-layer (NOT Pretrained hub modules) ''BERT-Base'' model for all of our experiments,unless otherwise stated. Then convert the original TSV datasets into TFRecord format. The discofuse dataset can be found here(https://github.com/google-research-datasets/discofuse)
# Preprocess
export BERT_BASE_DIR=/path/to/uncased_L-12_H-768_A-12
export DISCOFUSE_DIR=/path/to/discofuse
python preprocess_main \
--input_file="${DISCOFUSE_DIR}/train.tsv" \
--input_format="discofuse" \
--output_file="${OUTPUT_DIR}/train.tfrecord" \
--label_map_file="${OUTPUT_DIR}/label_map.json" \
--vocab_file="${BERT_BASE_DIR}/vocab.txt" \
--do_lower_case="True" \
--use_open_vocab="True" \
--max_seq_length="128" \
--use_pointing="${USE_POINTING}" \
--split_on_punc="True"
python preprocess_main.py \
--input_file="${DISCOFUSE_DIR}/tune.tsv" \
--input_format="discofuse" \
--output_file="${OUTPUT_DIR}/tune.tfrecord" \
--label_map_file="${OUTPUT_DIR}/label_map.json" \
--vocab_file="${BERT_BASE_DIR}/vocab.txt" \
--do_lower_case="True" \
--use_open_vocab="True" \
--max_seq_length="128" \
--use_pointing="${USE_POINTING}" \
--split_on_punc="True"
Model hyperparameters are specified in felix_config.json. This configuration file extends
bert_config.json
which comes with the zipped pretrained BERT model.
note These models can be trained independently, as such it is quicker to train them in parallel rather than sequentially.
Train the models on CPU/GPU.
# Train
python run_felix \
--train_file="${OUTPUT_DIR}/train.tfrecord" \
--eval_file="${OUTPUT_DIR}/tune.tfrecord" \
--model_dir_tagging="${OUTPUT_DIR}/model_tagging" \
--bert_config_tagging="${DISCOFUSE_DIR}/felix_config.json" \
--max_seq_length=128 \
--num_train_epochs=500 \
--num_train_examples=8 \
--num_eval_examples=8 \
--train_batch_size="32" \
--eval_batch_size="32" \
--log_steps="100" \
--steps_per_loop="100" \
--train_insertion="False" \
--use_pointing="${USE_POINTING}" \
--init_checkpoint="${BERT_DIR}/bert_model.ckpt" \
--learning_rate="0.00003" \
--pointing_weight="1" \
--input_format="recordio" \
--use_weighted_labels="True"
rm -rf "${DATA_DIRECTORY}/model_insertion"
mkdir "${DATA_DIRECTORY}/model_insertion"
python run_felix \
--train_file="${OUTPUT_DIR}/train.tfrecord.ins" \
--eval_file="${OUTPUT_DIR}/tune.tfrecord.ins" \
--model_dir_insertion="${OUTPUT_DIR}/model_insertion" \
--bert_config_insertion="${DISCOFUSE_DIR}/felix_config.json" \
--max_seq_length=128 \
--num_train_epochs=500 \
--num_train_examples=8 \
--num_eval_examples=8 \
--train_batch_size="32" \
--eval_batch_size="32" \
--log_steps="100" \
--steps_per_loop="100" \
--init_checkpoint="${BERT_DIR}/bert_model.ckpt" \
--use_pointing="${USE_POINTING}" \
--learning_rate="0.00003" \
--pointing_weight="1" \
--input_format="recordio" \
--train_insertion="True"
To train on Cloud TPU, you should additionally set:
--use_tpu=true \
--tpu_name=${TPU_NAME}
Please see BERT TPU instructions and the Google Cloud TPU tutorial for how to use Cloud TPUs.
# Predict
export PREDICTION_FILE=${OUTPUT_DIR}/pred.tsv
python predict_main \
--input_format="discofuse" \
--predict_input_file="${DISCOFUSE_DIR}/test.tsv" \
--predict_output_file="${PREDICTION_FILE}"\
--label_map_file="${OUTPUT_DIR}/label_map.json" \
--vocab_file="${BERT_BASE_DIR}/vocab.txt" \
--max_seq_length=128 \
--predict_batch_size=32 \
--do_lower_case="True" \
--use_open_vocab="True" \
--bert_config_tagging="${DISCOFUSE_DIR}/felix_config.json" \
--bert_config_insertion="${DISCOFUSE_DIR}/felix_config.json" \
--model_tagging_filepath="${OUTPUT_DIR}/model_tagging" \
--model_insertion_filepath="${OUTPUT_DIR}/model_insertion" \
--use_pointing="${USE_POINTING}"
To predict on Cloud TPU, you should additionally set:
--use_tpu=true \
--tpu_name=${TPU_NAME}
The predictions output a TSV file with four columns: Source, the input to the insertion model, the final output, and the reference. Note the felix output is tokenized (WordPieces), including a start "[CLS]" and end "[SEP]". WordPieces can be removed by replacing " ##" with "". Additionally words have been split on punctuation "don't -> don ' t", this must also be reversed.
@inproceedings{mallinson-etal-2020-felix,
title = '{FELIX}: Flexible Text Editing Through Tagging and Insertion',
author = 'Mallinson, Jonathan and
Severyn, Aliaksei and
Malmi, Eric and
Garrido, Guillermo',
booktitle = 'Findings of the Association for Computational Linguistics: EMNLP 2020',
month = nov,
year = '2020',
address = 'Online',
publisher = 'Association for Computational Linguistics',
url = 'https://www.aclweb.org/anthology/2020.findings-emnlp.111',
doi = '10.18653/v1/2020.findings-emnlp.111',
pages = '1244--1255',
}
Apache 2.0; see LICENSE for details.
This repository contains a TensorFlow 2 reimplementation of our original TensorFlow 1 code used for the paper and thus some discrepancies compared to the paper results are possible. However, we've verified that we get the similar results on the DiscoFuse dataset.
This is not an official Google product.