Text-Mining Differences in Chinese Newspaper Articles
Under Development!
This tutorial is not in its final state. The content may change a lot in the next months. Because of this status, it is also not listed in the topic pages.
Author(s) |
|
OverviewQuestions:
Objectives:
How can I automatically compare two Chinese newspaper articles?
What characters were censored in a Chinese newspaper published in Hong Kong in the 1930s?
Requirements:
Learn to clean and compare two texts.
Extract specific information out of texts.
Visualise your results in a word cloud.
Time estimation: 1 hourLevel: Introductory IntroductorySupporting Materials:Published: Feb 28, 2025Last modification: Feb 28, 2025License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITversion Revision: 1
The British Hong Kong Government censored Chinese newspapers before their publication in the colony in the 1930s (Ng 2022). Those redactions were visibly marked by replacement characters like ×, making them visible even to those who did not read any Chinese.
The schematic example, adapted from Schneider 2024, shows what such a censored Chinese article looked like. It is read from right to left and top to bottom. The two more prominent lines on the right are the article title, and the following text is the article’s main body. It contains the character × several times, indicating various instances where it was censored.
Despite this obvious form of censorship, no research is looking into what Chinese characters the × replaced. My dissertation (Schneider 2024), which informs this workflow, started at this point. Through extensive research, I found several articles censored in the Hong Kong edition of Da gong bao (大公報) as uncensored versions. Those mostly came from Chinese editions printed in mainland China, where different censorship regulations applied. Those articles from China were not censored and openly showed the characters redacted in the Hong Kong versions. An example of a censored article could be Anon. 1938 and of the uncensored version Anon. 1938.
The tutorial uses text mining to compare censored and uncensored text and to answer the following question: What characters were censored in a Chinese newspaper published in Hong Kong in the 1930s?
AgendaIn this tutorial, we will cover:
Uploading the Texts
The machine-readable versions of the Chinese newspaper articles I originally used in my dissertation come from a proprietary database and can not be shared here. Instead, I generated a dummy article with a similar setup in GPT and manually adapted the censorship symbols based on my research. The articles differ in style and punctuation, as is consistent with the articles in my research. Therefore, the input files are two texts in traditional Chinese. The first is censored, containing ×, and the second one is uncensored and does not contain replacement symbols. Both texts slightly differ in their layout, which will be unified later.
Hands On: Upload the Texts
Create a new history for this tutorial
To create a new history simply click the new-history icon at the top of the history panel:
Import the files from Zenodo
https://zenodo.org/records/14899614/files/Example_Chinese_newspaper_censored.txt https://zenodo.org/records/14899614/files/Example_Chinese_newspaper_uncensored.txt
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
- Rename the datasets
Check that the datatype
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
datatypes
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Question: What do the uploaded texts look like?
- Name at least one visual difference you notice between the two texts. (No need to understand their content here.)
- Visual differences you notice might be:
- The texts have different headlines
- The uncensored text contains more paragraphs
- The censored text contains the symbol × several times
- The censored text contains additional symbols at the end
- The texts use slightly different punctuation (impressive catch!)
Pre-processing
We pre-process and clean both texts to make the comparison easier and more apparent. This step will unify their layout. It uses Regular Expressions (Regex) to find and replace certain text parts. Here, the Regex also helps to restructure the text.
Regular Expressions are a powerful tool to modify text automatically based on word patterns. To read more about Regex’s specifics, see its documentation. You can check what your input matches in Regex 101. Enter the Regular Expression you want to try in the top field. Insert a sample text below to see what content your Regex catches and to find out how to adapt it.
Clean up both texts
We will use Regular Expressions in a tool called “Replace text”. It contains four different sub-steps. Those will vary if you upload different texts. Apply this step first to the censored and then to the uncensored text to get two cleaned ones.
Hands On: Cleaning the Text with Regular Expressions
- Replace Text ( Galaxy version 9.3+galaxy1) with the following parameters:
- param-file “File to process”:
output
(Input dataset)- In “Replacement”:
- param-repeat “Insert Replacement”
- “Find pattern”:
\r
- param-repeat “Insert Replacement”
- “Find pattern”:
\n
- param-repeat “Insert Replacement”
- “Find pattern”:
\s
- param-repeat “Insert Replacement”
- “Find pattern”:
(.)
- “Replace with:”:
\1\n
Comment: Explaining the above Regular ExpressionsRegular expressions can not only find particular words, as you might be familiar with from regular text editors. It is more powerful and can find particular patterns, for example, only capitalised words or all numbers. In this step, we mostly delete unnecessary placeholders. The first pattern we want to find is
\r
. It catches a specific form of invisible linebreaks that would create unwanted gaps in the comparison later. We delete those by leaving the optional “Replace with” field blank. Similarly,\n
marks linebreaks. We also delete those by leaving the optional “Replace with” field blank. The next expression we search for is\s
. Those are spaces as you see them between words on your computer. We delete those. As a result, there are no gaps in our text anymore. In the last step, we want to choose each character with(.)
and reformat it. We want to have one character per line. Therefore, we replace all characters with\1\n
.\1
means the responding character, and\n
adds a linebreak after each. The result is a clean and reformatted text.Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.
Finding
Below are just a few examples of basic expressions:
Regular expression Matches abc
an occurrence of abc
within your data(abc|def)
abc
ordef
[abc]
a single character which is either a
,b
, orc
[^abc]
a character that is NOT a
,b
, norc
[a-z]
any lowercase letter [a-zA-Z]
any letter (upper or lower case) [0-9]
numbers 0-9 \d
any digit (same as [0-9]
)\D
any non-digit character \w
any alphanumeric character \W
any non-alphanumeric character \s
any whitespace \S
any non-whitespace character .
any character \.
{x,y}
between x and y repetitions ^
the beginning of the line $
the end of the line Note: you see that characters such as
*
,?
,.
,+
etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So\?
matches the question mark character exactly.Examples
Regular expression matches \d{4}
4 digits (e.g. a year) chr\d{1,2}
chr
followed by 1 or 2 digits.*abc$
anything with abc
at the end of the line^$
empty line ^>.*
Line starting with >
(e.g. Fasta header)^[^>].*
Line not starting with >
(e.g. Fasta sequence)Replacing
Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups
(...)
, which we can refer to using\1
,\2
etc for the first and second captured values. If you want to refer to the whole match, use&
.
Regular expression Input Captures chr(\d{1,2})
chr14
\1 = 14
(\d{2}) July (\d{4})
24 July 1984 \1 = 24
,\2 = 1984
An expression like
s/find/replacement/g
indicates a replacement expression, this will search (s
) for any occurrence offind
, and replace it withreplacement
. It will do this globally (g
) which means it doesn’t stop after the first match.Example:
s/chr(\d{1,2})/CHR\1/g
will replacechr14
withCHR14
etc.You can also use replacement modifier such as convert to lower case
\L
or upper case\U
. Example:s/.*/\U&/g
will convert the whole text to upper case.Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don’t have to use the
s/../../g
structure.There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.
Tip: RegexOne is a nice interactive tutorial to learn the basics of regular expressions.
Tip: Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.
Tip: Cyrilex is a visual regular expression tester.
Remember to apply those steps to both the censored and the uncensored text. Rename both new texts in a meaningful way. Choose, for example, “Cleaned Censored Text” and “Cleaned Uncensored Text”.
Question: Take a look at the texts
- What do the texts look like now? How have they changed?
- Before, the texts showed many sentences in one line. Now, both texts show one character per line and have more lines. The layout is entirely different.
Comparing the censored and uncensored text
We can now compare the two cleaned texts. This will visualise the differences between the two texts and mark them by colour. Make sure to upload the cleaned censored text with the replacement characters like ‘×’ first. As text two, upload the cleaned uncensored text without the replacement characters. This version (HTML version) creates an HTML file, which colour codes differences as additions (green) or extractions (red) when comparing the texts.
For Researchers
Hands On: Comparing the texts using diff tool
- diff ( Galaxy version 3.10+galaxy0) with the following parameters:
- param-file “First input file”:
outfile
(output of Replace Text tool)- param-file “Second input file”:
outfile
(output of Replace Text tool)- “Choose a report format”:
Generates an HTML report to visualize the differences
Question: Take a look at the HTML file
- What can you see in line 20 / 23?
- Line 20 on the left shows a red comma on the left side of the table. It corresponds with line 23 on the right, which contains a colon in green. This means the punctuation in the file differs. The censored version contains a comma, while the uncensored one includes a colon.
The HTML file could look like this:

It shows what passages differ in the two texts. Red parts show deletions and green-coloured areas are additions. This output is very convenient for researchers, as it shows differences quickly. However, it is not helpful for further processing with Galaxy. For this, we run this tool a second time with slightly changed parameters. The output is the basis for our further analysis.
Create a file for further processing
This step runs the text comparison line by line again to create a raw file that the computer can work with. It is less intuitive to understand at first glance. Again, clean the censored text with the replacement characters like ‘×’ first and the uncensored text second.
Hands On: Run another diff tool
- diff ( Galaxy version 3.10+galaxy0) with the following parameters:
- param-file “First input file”:
outfile
(output of Replace Text tool)- param-file “Second input file”:
outfile
(output of Replace Text tool)- “Choose a report format”:
Text file, side-by-side (-y)
Comment: The output formatAs you can see above, the output of this file is a text file, compared to the HTML in the last step.
Question: Look at this step's output
- How does this file differ from the HTML file in the last step?
- This output is a raw file and shows several columns. Changes between the two texts are not coloured. Instead, they are marked by various symbols in column 8.
Select only censored lines
In the next step, we want to extract specific lines only. To determine what content was redacted in the first text, we filter the last step’s raw output for lines containing the censorship symbol ×.
Hands On: Filter text
- Filter with the following parameters:
- param-file “Filter”:
diff_file
(output of diff tool)- “With following condition”:
ord(c1) == 215
The condition “ord(c1) == 215” means that column c1’s lines, which contain the censored text, are selected if they match ×. The symbol × is unspecific. Therefore, the Unicode identifier of the character (215) is used for clarity in this condition.
Comment: Filter for other charactersAdd another Unicode here if you want to select a different character, for example, ‘□’ or ‘△’. For example, you can get the respective code on Character Code Finder. Copy the character you want to filter for in the “input” window and select “Decimal Character Codes” as an output. If you do this for symbol ×, you get 215.
Question: What output do you get?
- How many lines contain ×?
- 13
Ensure Consistent File Format
After filtering for the censored lines, we insert a sub-step to ensure smooth computing. The previous setup could cause an error if the characters filtered in the last step were erased. Then, the extracted file would miss the last column, which would cause an error. This is invisible to the researchers in the file. The compute step covers this potential error and ensures all necessary columns exist.
Hands On: Compute to ensure all columns exist
- Compute ( Galaxy version 2.1) with the following parameters:
- param-file “Input file”:
out_file1
(output of Filter tool)- “Input has a header line with column names?”:
No
- In “Expressions”:
- param-repeat “Insert Expressions”
- “Add expression”:
c9
- “Mode of the operation”:
Replace
- “Use new column to replace column number”:
9
- In “Error handling”:
- “If an expression cannot be computed for a row”:
Produce an empty column value for the row
Summarise your findings
This step sums up how often each character appeared on the table before.
Hands On: Task description
- Datamash ( Galaxy version 1.8+galaxy0) with the following parameters:
- param-file “Input tabular dataset”:
out_file1
(output of Compute tool)- “Group by fields”:
9
- “Sort input”:
Yes
- In “Operation to perform on each group”:
- param-repeat “Insert Operation to perform on each group”
- “On column”:
c9
Question
- How many lines does the file have now?
- Three lines showing different characters: 寇, 敵 and 日.
Sort your findings
Particularly if you get a long list in the last step, sorting the results from the most to least frequent characters is necessary. If you are only interested in the quantitative results, this can be your final output.
Hands On: Sort
- Sort with the following parameters:
- param-file “Sort Dataset”:
out_file
(output of Datamash tool)- “on column”:
c2
Comment: How to sort?Select column
c2
because it contains the character frequency.
Question: Check your results
- How often did the most frequently censored character appear?
- The character 敵, meaning enemy, was censored 10 times in the first text.
Why would the British Hong Kong Government consistently censor this character? Jump to the conclusion to find out.
Cut out the censored characters only
If you want to visualise of your results, this step gets you there. We select only the uncensored characters from text two. The result is only one column with different rows of Chinese characters. It allows scaling words by frequency in the word cloud in the next step. As a result, characters that appear more often appear bigger, making the results evident at first sight.
Hands On: Select the censored characters
- Cut with the following parameters:
- “Cut columns”:
c9
- param-file “From”:
out_file1
(output of Compute tool)
c9
means column 9. It contains the uncensored characters from text two and is, therefore, cut out in this step.
Generate a word cloud
The last step is to visualise the results within a word cloud. It shows, which characters were censored in the first text. The bigger the word, the more often it appeared in the text.
Hands On: Task description
- Generate a word cloud ( Galaxy version 1.9.4+galaxy0) with the following parameters:
- param-file “Input file”:
out_file1
(output of Cut tool)- ” Smallest font size to use”:
8
- “Color option”:
Color
- “Ratio of times to try horizontal fitting as opposed to vertical”:
1.0
- “Scaling of words by frequency (0 - 1)”:
0.9
You can choose different colours to suit your needs. The higher the “Ratio of times to try horizontal fitting as opposed to vertical” is towards “1”, the more likely the character or word will appear horizontally. “Scaling of words by frequency (0 - 1)” allows you to scale the words according to their amount. The smaller this number, the more equal-sized the characters in your word cloud will be, no matter their amount.
Your word cloud should look similar to this:
Conclusion
This tutorial used text mining to extract censored characters from a Chinese newspaper.

The uploaded dummy texts contained several differences. They used slightly different punctuation, and some sentences and characters differed. The most obvious difference is that the second text was published uncensored in China, while the original text, published in Hong Kong, contained censorship symbols. This allowed us to extract what characters were censored in the text from Hong Kong.
Within this workflow, we first unified the layout of both texts, showing one character per line for an easier comparison with diff tool. The tool marked the differences between both texts in colour. Afterwards, we extracted only lines censored with ×. The extraction of the results ran in two strands: One was counting and sorting the results. This will answer what characters the British Hong Kong Government censored in their Chinese newspapers in the 1930s: Based on the (simplified) dummy texts, the characters were 敵 (enemy), 寇 (brave) and 日 (Japan). The character for enemy dominates and was censored five times more often than the character for brave.
What do those findings tell us? The British Hong Kong Government avoided publishing newspapers with a strong stand against Japan. Why? Because the British colony Hong Kong, with a large Chinese population, is located very close to the Chinese mainland. Especially after the Japanese army invaded China in the summer of 1937, the British had to walk a tightrope. They tried to support the Chinese efforts without offending Japan. As a British outpost, Hong Kong had little military power and would not withstand a Japanese attack for long. Therefore, the British tried to appease the Japanese Government and avoid an attack. Calling them brave or enemy openly would have been dangerous. The one redaction of 日 (Japan) is very uncommon. This shows that the censorship practices were adaptable and not always unified. Censoring Hong Kong’s newspapers to avoid anti-Japanese content is therefore a practical example of how appeasement policies from the British Government were implemented locally. This newspaper comparison is consistent with the findings in archival sources that I also researched for my dissertation (Schneider 2024) and lays the censored characters open for the first time.