从URL连接Java读取

我正在尝试从URL连接中读取 HTML代码.在一个案例中,我试图阅读的html文件包括实际doc类型声明之前的5个换行符.在这种情况下,输入阅读器会抛出EOF异常.

URL pageUrl = 
    new URL(
        "http://www.nytimes.com/2011/03/15/sports/basketball/15nbaround.html"
    );

URLConnection getConn = pageUrl.openConnection();
getConn.connect();
DataInputStream dis = new DataInputStream(getConn.getInputStream());
//some read method here

有没有人遇到这样的问题?

URL pageUrl = new URL("http://www.nytimes.com/2011/03/15/sports/basketball/15nbaround.html");
URLConnection getConn = pageUrl.openConnection();
getConn.connect();
DataInputStream dis = new DataInputStream(getConn.getInputStream());
String urlData = "";
while ((urlData = dis.readUTF()) != null)
    System.out.println(urlData);

//抛出异常

java.io.EOFException
at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:323)
at java.io.DataInputStream.readUTF(DataInputStream.java:572)
at java.io.DataInputStream.readUTF(DataInputStream.java:547)

在bufferedreader的情况下,它只响应null并且不继续

pageUrl = new URL("http://www.nytimes.com/2011/03/15/sports/basketball/15nbaround.html");
URLConnection getConn = pageUrl.openConnection();
getConn.connect();
BufferedReader br = new BufferedReader(new InputStreamReader(getConn.getInputStream()));
String urlData = "";
while(true)
     urlData = br.readLine();
     System.out.println(urlData);

输出null

您正在使用DataInputStream来读取未使用DataOutputStream编码的数据.检查对DataInputStream调用的记录行为#readUtf(); it first reads two bytes形成一个16位整数,表示包含UTF编码字符串的后续字节数.您从HTTP服务器读取的数据不以此格式编码.

相反,HTTP服务器按照RFC 2616第6.1和2.2节发送以ASCII编码的标头.您需要将标题读取为文本,然后确定邮件正文(“实体”)的编码方式.

相关文章
相关标签/搜索