diff --git a/NLP_classification/Text classification with TensorFlow Hub: Movie reviews.md b/NLP_classification/Text classification with TensorFlow Hub: Movie reviews.md new file mode 100644 index 0000000000000000000000000000000000000000..a78f3390e89aac9b8ba7da86afac518aaca4b778 --- /dev/null +++ b/NLP_classification/Text classification with TensorFlow Hub: Movie reviews.md @@ -0,0 +1,136 @@ +This notebook classifies movie reviews as *positive* or *negative* using the text of the review. This is an example of *binary*—or two-class—classification, an important and widely applicable kind of machine learning problem. + +The tutorial demonstrates the basic application of transfer learning with [TensorFlow Hub](https://tfhub.dev/) and Keras. + +It uses the [IMDB dataset](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb) that contains the text of 50,000 movie reviews from the [Internet Movie Database](https://www.imdb.com/). These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are *balanced*, meaning they contain an equal number of positive and negative reviews. + +This notebook uses [`tf.keras`](https://www.tensorflow.org/guide/keras), a high-level API to build and train models in TensorFlow, and [`tensorflow_hub`](https://www.tensorflow.org/hub), a library for loading trained models from [TFHub](https://tfhub.dev/) in a single line of code. For a more advanced text classification tutorial using [`tf.keras`](https://www.tensorflow.org/api_docs/python/tf/keras), see the [MLCC Text Classification Guide](https://developers.google.com/machine-learning/guides/text-classification/). + +``` +import os +import numpy as np + +import tensorflow as tf +import tensorflow_hub as hub +import tensorflow_datasets as tfds + +print("Version: ", tf.__version__) +print("Eager mode: ", tf.executing_eagerly()) +print("Hub version: ", hub.__version__) +print("GPU is", "available" if tf.config.list_physical_devices("GPU") else "NOT AVAILABLE") +``` + +## Download the IMDB dataset + +The IMDB dataset is available on imdb reviews or on TensorFlow datasets. The following code downloads the IMDB dataset to your machine (or the colab runtime): + +``` +# Split the training set into 60% and 40% to end up with 15,000 examples +# for training, 10,000 examples for validation and 25,000 examples for testing. +train_data, validation_data, test_data = tfds.load( + name="imdb_reviews", + split=('train[:60%]', 'train[60%:]', 'test'), + as_supervised=True) +``` + +## Explore the data + +Let's take a moment to understand the format of the data. Each example is a sentence representing the movie review and a corresponding label. The sentence is not preprocessed in any way. The label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review. +Let's print first 10 examples + +``` +train_examples_batch, train_labels_batch = next(iter(train_data.batch(10))) +train_examples_batch +``` + +## Build the model + +The neural network is created by stacking layers—this requires three main architectural decisions: + +- How to represent the text? +- How many layers to use in the model? +- How many *hidden units* to use for each layer? + +In this example, the input data consists of sentences. The labels to predict are either 0 or 1. + +One way to represent the text is to convert sentences into embeddings vectors. Use a pre-trained text embedding as the first layer, which will have three advantages: + +- You don't have to worry about text preprocessing, +- Benefit from transfer learning, +- the embedding has a fixed size, so it's simpler to process. + +For this example you use a **pre-trained text embedding model** from [TensorFlow Hub](https://tfhub.dev/) called [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2). + +There are many other pre-trained text embeddings from TFHub that can be used in this tutorial: + +- [google/nnlm-en-dim128/2](https://tfhub.dev/google/nnlm-en-dim128/2) - trained with the same NNLM architecture on the same data as [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2), but with a larger embedding dimension. Larger dimensional embeddings can improve on your task but it may take longer to train your model. +- [google/nnlm-en-dim128-with-normalization/2](https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2) - the same as [google/nnlm-en-dim128/2](https://tfhub.dev/google/nnlm-en-dim128/2), but with additional text normalization such as removing punctuation. This can help if the text in your task contains additional characters or punctuation. +- [google/universal-sentence-encoder/4](https://tfhub.dev/google/universal-sentence-encoder/4) - a much larger model yielding 512 dimensional embeddings trained with a deep averaging network (DAN) encoder. + +And many more! Find more [text embedding models](https://tfhub.dev/s?module-type=text-embedding) on TFHub. + +Let's first create a Keras layer that uses a TensorFlow Hub model to embed the sentences, and try it out on a couple of input examples. Note that no matter the length of the input text, the output shape of the embeddings is: `(num_examples, embedding_dimension)`. + +``` +embedding = "https://tfhub.dev/google/nnlm-en-dim50/2" +hub_layer = hub.KerasLayer(embedding, input_shape=[], + dtype=tf.string, trainable=True) +hub_layer(train_examples_batch[:3]) +``` + +Let's now build the full model: + +``` +model = tf.keras.Sequential() +model.add(hub_layer) +model.add(tf.keras.layers.Dense(16, activation='relu')) +model.add(tf.keras.layers.Dense(1)) + +model.summary() +``` + +The layers are stacked sequentially to build the classifier: + +1. The first layer is a TensorFlow Hub layer. This layer uses a pre-trained Saved Model to map a sentence into its embedding vector. The pre-trained text embedding model that you are using ([google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2)) splits the sentence into tokens, embeds each token and then combines the embedding. The resulting dimensions are: `(num_examples, embedding_dimension)`. For this NNLM model, the `embedding_dimension` is 50. +2. This fixed-length output vector is piped through a fully-connected (`Dense`) layer with 16 hidden units. +3. The last layer is densely connected with a single output node. + +Let's compile the model. + +### Loss function and optimizer + +A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs logits (a single-unit layer with a linear activation), you'll use the `binary_crossentropy` loss function. + +This isn't the only choice for a loss function, you could, for instance, choose `mean_squared_error`. But, generally, `binary_crossentropy` is better for dealing with probabilities—it measures the "distance" between probability distributions, or in our case, between the ground-truth distribution and the predictions. + +Later, when you are exploring regression problems (say, to predict the price of a house), you'll see how to use another loss function called mean squared error. + +Now, configure the model to use an optimizer and a loss function: + +``` +model.compile(optimizer='adam', + loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), + metrics=['accuracy']) +``` + +## Train the model + +Train the model for 10 epochs in mini-batches of 512 samples. This is 10 iterations over all samples in the `x_train` and `y_train` tensors. While training, monitor the model's loss and accuracy on the 10,000 samples from the validation set: + +``` +history = model.fit(train_data.shuffle(10000).batch(512), + epochs=10, + validation_data=validation_data.batch(512), + verbose=1) +``` + +## Evaluate the model + +And let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy. + +``` +results = model.evaluate(test_data.batch(512), verbose=2) + +for name, value in zip(model.metrics_names, results): + print("%s: %.3f" % (name, value)) +``` \ No newline at end of file diff --git "a/NLP_recommend/\345\270\270\347\224\250\345\267\245\345\205\267\346\216\250\350\215\220.md" "b/NLP_recommend/\345\270\270\347\224\250\345\267\245\345\205\267\346\216\250\350\215\220.md" new file mode 100644 index 0000000000000000000000000000000000000000..f67a35fbffc864e75c5d3af2ca0678da5534f388 --- /dev/null +++ "b/NLP_recommend/\345\270\270\347\224\250\345\267\245\345\205\267\346\216\250\350\215\220.md" @@ -0,0 +1,47 @@ +## TensorFlow Recommenders + +TensorFlow Recommenders (TFRS) is a library for building recommender system models. + +It helps with the full workflow of building a recommender system: data preparation, model formulation, training, evaluation, and deployment. + +It's built on Keras and aims to have a gentle learning curve while still giving you the flexibility to build complex models. + +TFRS makes it possible to: + +- Build and evaluate flexible recommendation retrieval models. +- Freely incorporate item, user, and context information into recommendation models. +- Train multi-task models that jointly optimize multiple recommendation objectives. + +``` +import tensorflow_datasets as tfds +import tensorflow_recommenders as tfrs + +# Load data on movie ratings. +ratings = tfds.load("movielens/100k-ratings", split="train") +movies = tfds.load("movielens/100k-movies", split="train") + +# Build flexible representation models. +user_model = tf.keras.Sequential([...]) +movie_model = tf.keras.Sequential([...]) + +# Define your objectives. +task = tfrs.tasks.Retrieval(metrics=tfrs.metrics.FactorizedTopK( + movies.batch(128).map(movie_model) + ) +) + +# Create a retrieval model. +model = MovielensModel(user_model, movie_model, task) +model.compile(optimizer=tf.keras.optimizers.Adagrad(0.5)) + +# Train. +model.fit(ratings.batch(4096), epochs=3) + +# Set up retrieval using trained representations. +index = tfrs.layers.ann.BruteForce(model.user_model) +index.index(movies.batch(100).map(model.movie_model), movies) + +# Get recommendations. +_, titles = index(np.array(["42"])) +print(f"Recommendations for user 42: {titles[0, :3]}") +``` \ No newline at end of file