regex - How to use regular expressions in java to remove certain characters -
general question is: how parse string , eliminate punctuation , replace of them?
i'm trying modify input text. case have normal text file, punctuation , want of them eliminated. if symbol . ! ? ... want replace "" string.
i never used regex , tried string comparison, isn't sufficient cases. have trouble if there 2 punctuation marks; in text "the second day (the 4ht).", when have ). togheter.
for example, given input expect following:
input : [...] @ it!" speech caused excpected output : @ <s> speech caused
every word in code added arraylist because need work later.
thanks lot!
fileinputstream fileinputstream = new fileinputstream("text.txt"); inputstreamreader inputstreamreader = new inputstreamreader( fileinputstream, "utf-8"); bufferedreader bf = new bufferedreader(inputstreamreader); words.add("<s>"); string s; while ((s = bf.readline()) != null) { string[] var = s.split(" "); (int = 0; < var.length; i++) { if (var[i].endswith(",") || var[i].endswith(")") || var[i].endswith("(") || var[i].endswith(":") || var[i].endswith(";") ||var[i].endswith("'")) { var[i] = var[i].substring(0, var[i].length() - 1); words.add(var[i].tolowercase()); } else if ( var[i].startswith("'")) { var[i] = var[i].substring(1, var[i].length() ); words.add(var[i].tolowercase()); } else if (var[i].endswith(".") || var[i].endswith("...") || var[i].endswith("!") || var[i].endswith("?")) { var[i] = var[i].substring(0, var[i].length() - 1); words.add(var[i].tolowercase()); words.add("<s>"); } else { words.add(var[i].tolowercase()); // // system.out.println("\n neu eingelesenes wort: " + var[i]); }} }
first use regex filter out punctuations , split space , add result list:
fileinputstream fileinputstream = new fileinputstream("text.txt"); inputstreamreader inputstreamreader = new inputstreamreader( fileinputstream, "utf-8"); bufferedreader bf = new bufferedreader(inputstreamreader); words.add("<s>"); string s; while ((s = bf.readline()) != null) { s = s.replaceall("[^a-za-z ]", ""); // replace non-word/non-space characters empty string string[] var = s.split(" "); words.addall(var); }
Comments
Post a Comment